qtzx06 commited on
Commit
eac9d9f
·
1 Parent(s): b0b9657

feat: finalize swarm tooling and submission artifacts

Browse files
.gitignore CHANGED
@@ -5,6 +5,8 @@ __pycache__/
5
  build/
6
  dist/
7
  outputs/
 
 
 
8
  *.egg-info/
9
  *.pyc
10
-
 
5
  build/
6
  dist/
7
  outputs/
8
+ zero960_logs/
9
+ zero960_grpo_output/
10
+ zero960_grpo_final/
11
  *.egg-info/
12
  *.pyc
 
README.md CHANGED
@@ -1,48 +1,252 @@
1
  # 0x960
2
 
3
- 0x960 is an OpenEnv-oriented environment where a model improves a minimal Chess960 engine by editing a bounded evaluation file and getting rewarded by match outcomes.
4
 
5
- ## background
6
 
7
- Chess960 is a strong benchmark for generalization because the rules of chess stay the same while the starting position changes across 960 legal configurations. That removes much of the opening-book structure that standard chess systems can exploit and puts more pressure on transferable positional reasoning and search.
8
 
9
- Recent engine and research results make this useful for our setting. Classical search engines such as Stockfish remain extremely strong in Chess960, while several neural and RL-heavy systems lose more relative strength than they do in standard chess. Recent work also shows that transformer chess models trained on standard chess suffer noticeable drops on Chess960 positions, which suggests that high in-distribution performance can still rely on brittle configuration-specific pattern matching.
10
 
11
- 0x960 turns that observation into an OpenEnv task. Instead of asking a model to output chess moves directly, we ask it to improve a minimal Chess960 engine through bounded code edits. The model reads files, edits the evaluation logic, runs checks, and gets rewarded by whether the edited engine performs better against a baseline.
 
12
 
13
- ## why chess960
14
 
15
- - Chess960 is a controlled distribution shift: the rules are unchanged, but the initial conditions vary.
16
- - That makes it a cleaner test of robustness than standard chess alone.
17
- - The agent is not rewarded for imitation or move prediction; it is rewarded for improving a real system.
18
- - This makes the environment a better fit for OpenEnv than a direct gameplay benchmark because it requires multi-step tool use, debugging, and iterative refinement.
19
 
20
- ## repo layout
21
 
22
- - `docs/`: concept, architecture, and scope docs
23
- - `src/zero960/`: shared engine and episode runtime logic
24
- - `src/zero960_env/`: OpenEnv-facing models, server, and client
25
- - `train/`: minimal TRL/Colab-oriented training entrypoints
 
26
 
27
- ## current status
28
 
29
- This repo currently contains a thin but functional skeleton:
30
 
31
- - minimal Chess960 engine core
32
- - workspace-based bounded file editing runtime
33
- - OpenEnv wrapper scaffold
34
- - minimal TRL rollout stub
35
 
36
- ## next steps
37
 
38
- 1. tighten the engine and reward harness
39
- 2. validate the OpenEnv app structure against `0.2.1`
40
- 3. add a small Colab notebook around the training stub
41
- 4. deploy the server to HF Spaces
42
 
43
- ## supporting docs
44
 
45
- - `docs/why_chess960.md`: short research framing for judges and README reuse
46
- - `docs/demo-script.md`: one-minute demo outline
47
- - `docs/process.md`: chronological build log for demo storytelling and judging
48
- - `docs/agent-log-instruction.md`: reusable instruction snippet for coding agents
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # 0x960
2
 
3
+ 0x960 is an OpenEnv environment where a model improves a minimal Chess960 engine by editing a bounded `eval.py` file and getting rewarded by match outcomes.
4
 
5
+ The core task is not "play chess." The task is "act like a bounded engine engineer": inspect the current evaluation logic, edit it, test it, and decide when to finish.
6
 
7
+ ## Current Direction
8
 
9
+ The repo currently supports two training paths:
10
 
11
+ - teacher distillation first: collect high-quality bounded-action trajectories from Codex or another strong coding agent, then fine-tune a smaller open model on those traces
12
+ - RL refinement second: use the OpenEnv reward loop to sharpen a student model that already knows the `write_file -> run_match -> finish` workflow
13
 
14
+ This ordering is deliberate. The main failure mode so far has not been raw model size; it has been weak action priors. Base models tend to spam `run_static_eval` or `finish` instead of discovering code edits. Distillation fixes that faster than asking GRPO to invent the workflow from scratch.
15
 
16
+ There is also a complementary outer-loop path: use multiple local Codex workers to iterate directly on the engine, benchmark every patch, keep only Elo-positive changes, and then distill the best traces back into an open student. See [Codex Swarm Plan](docs/codex-swarm-plan.md).
 
 
 
17
 
18
+ ## Repo Layout
19
 
20
+ - `src/zero960/`: engine, workspace, and episode runtime
21
+ - `src/zero960_env/`: OpenEnv server, models, and client
22
+ - `train/minimal_trl_openenv.py`: handcrafted demo, inference loop, and GRPO training entrypoint
23
+ - `train/codex_distill.py`: Codex teacher rollout collector and SFT sample exporter
24
+ - `docs/`: concise project docs and process log
25
 
26
+ ## Local Smoke Test
27
 
28
+ Start the OpenEnv server:
29
 
30
+ ```sh
31
+ uv run python -m uvicorn zero960_env.server.app:app --host 127.0.0.1 --port 8000
32
+ ```
 
33
 
34
+ Run the bounded-action demo:
35
 
36
+ ```sh
37
+ uv run python -m train.minimal_trl_openenv --mode handcrafted --base-url http://127.0.0.1:8000
38
+ ```
 
39
 
40
+ ## Codex Teacher Distillation
41
 
42
+ Prerequisites:
43
+
44
+ - Codex CLI installed and logged in
45
+ - local OpenEnv server running
46
+
47
+ Collect teacher rollouts and export SFT-ready samples:
48
+
49
+ ```sh
50
+ uv run python -m train.codex_distill \
51
+ --base-url http://127.0.0.1:8000 \
52
+ --model gpt-5.4 \
53
+ --episodes 20
54
+ ```
55
+
56
+ Outputs go to `outputs/codex_distill/`:
57
+
58
+ - `teacher_rollouts_*.jsonl`: raw per-episode teacher traces
59
+ - `sft_samples_*.jsonl`: filtered turn-level chat samples for student fine-tuning
60
+
61
+ ## Student SFT
62
+
63
+ Train a small student on the collected teacher traces:
64
+
65
+ ```sh
66
+ uv run python -m train.sft_student \
67
+ --model Qwen/Qwen3.5-0.8B \
68
+ --output-dir outputs/sft_qwen_0p8b
69
+ ```
70
+
71
+ Dry-run the dataset loader first if you want to verify counts and filtering:
72
+
73
+ ```sh
74
+ uv run python -m train.sft_student --dry-run
75
+ ```
76
+
77
+ The loader validates the assistant action JSON and drops malformed older rows automatically, so the early pre-cleanup SFT dump does not need manual editing.
78
+
79
+ ## Benchmarking Engine Strength
80
+
81
+ Compare two eval files on held-out Chess960 start positions:
82
+
83
+ ```sh
84
+ uv run python -m train.benchmark_eval \
85
+ --candidate-file src/zero960/workspace_template/eval.py \
86
+ --baseline-file src/zero960/engine/default_eval.py \
87
+ --positions 64 \
88
+ --depth 2
89
+ ```
90
+
91
+ This is the metric that matters for "better chess" in this repo. Training reward can teach the workflow, but real strength should be checked with held-out match score.
92
+
93
+ Benchmark a local eval file against an external UCI engine such as Stockfish:
94
+
95
+ ```sh
96
+ uv run python -m train.benchmark_uci \
97
+ --candidate-file src/zero960/workspace_template/eval.py \
98
+ --engine-command stockfish \
99
+ --engine-option UCI_LimitStrength=true \
100
+ --engine-option UCI_Elo=1320 \
101
+ --positions 32 \
102
+ --candidate-depth 2 \
103
+ --engine-depth 1
104
+ ```
105
+
106
+ This is the cleanest anchor for demo purposes: keep the repo baseline as `0 Elo`, then report how the current champion scores against fixed Stockfish settings under the same Chess960 benchmark.
107
+
108
+ Benchmark two full engine roots so each side uses its own `search.py` plus its own eval file:
109
+
110
+ ```sh
111
+ uv run python -m train.benchmark_engine \
112
+ --candidate-root /tmp/0x960-codex-swarm/worker-1 \
113
+ --baseline-root /Users/qtzx/Desktop/codebase/0x960 \
114
+ --positions 32 \
115
+ --depth 2
116
+ ```
117
+
118
+ Use this when you want to open search heuristics safely. The older eval-only benchmark is still the right promotion gate while workers only edit `eval.py`, but once search changes are allowed, head-to-head must load each side's own `search.py` instead of sharing the live repo implementation.
119
+
120
+ To benchmark a candidate against the original baseline plus accepted swarm champions:
121
+
122
+ ```sh
123
+ uv run python -m train.benchmark_league \
124
+ --candidate-file outputs/codex_swarm/champion_eval.py \
125
+ --positions 16
126
+ ```
127
+
128
+ By default this league includes the original baseline and the most recent accepted swarm snapshots, while skipping any snapshot that is byte-identical to the candidate. This is the simplest self-play style check for “did the engine improve against its own history, not just one baseline?”
129
+
130
+ To generate a static dashboard with swarm progression, league results, and optional Stockfish anchors:
131
+
132
+ ```sh
133
+ uv run python -m train.build_dashboard --include-stockfish
134
+ ```
135
+
136
+ This writes [index.html](outputs/dashboard/index.html) plus the backing [dashboard_data.json](outputs/dashboard/dashboard_data.json). Open the HTML file locally to inspect accepted champions, internal Elo deltas, league self-play, and anchor bars in one place.
137
+
138
+ To generate submission-ready PNGs for media uploads (score progression + anchor bars), run:
139
+
140
+ ```sh
141
+ python3 scripts/generate_submission_media.py
142
+ ```
143
+
144
+ This writes tracked files under `media/submission/`.
145
+
146
+ To also surface the current search gain against the saved pre-upgrade engine baseline:
147
+
148
+ ```sh
149
+ uv run python -m train.build_dashboard \
150
+ --include-engine-progress \
151
+ --engine-baseline-root /tmp/0x960-search-baseline \
152
+ --include-stockfish
153
+ ```
154
+
155
+ ## Local Codex Swarm
156
+
157
+ Initialize the local champion plus worker sandboxes:
158
+
159
+ ```sh
160
+ uv run python -m train.codex_swarm setup --workers 3
161
+ ```
162
+
163
+ Run one champion/challenger round with Codex workers:
164
+
165
+ ```sh
166
+ uv run python -m train.codex_swarm run \
167
+ --workers 5 \
168
+ --rounds 1 \
169
+ --model gpt-5.3-codex \
170
+ --screen-positions 8 \
171
+ --positions 16 \
172
+ --worker-timeout-sec 180 \
173
+ --max-diff-lines 80
174
+ ```
175
+
176
+ Run a search-focused round that edits only `src/zero960/engine/search.py` and benchmarks full engine roots:
177
+
178
+ ```sh
179
+ uv run python -m train.codex_swarm run \
180
+ --workers 3 \
181
+ --rounds 1 \
182
+ --surface search \
183
+ --model gpt-5.3-codex \
184
+ --screen-positions 4 \
185
+ --positions 8 \
186
+ --worker-timeout-sec 180 \
187
+ --max-diff-lines 100
188
+ ```
189
+
190
+ Dry-run the coordinator without invoking Codex:
191
+
192
+ ```sh
193
+ uv run python -m train.codex_swarm run --workers 3 --rounds 1 --dry-run --serial
194
+ ```
195
+
196
+ Run the swarm in a continuous champion/challenger loop:
197
+
198
+ ```sh
199
+ uv run python -m train.codex_swarm run \
200
+ --workers 5 \
201
+ --continuous \
202
+ --max-stall-rounds 3 \
203
+ --model gpt-5.3-codex \
204
+ --screen-positions 8 \
205
+ --positions 16 \
206
+ --max-diff-lines 80 \
207
+ --worker-timeout-sec 180
208
+ ```
209
+
210
+ The coordinator now rejects overgrown whole-file rewrites by default. Workers are expected to make surgical `eval.py` edits that stay within the `--max-diff-lines` budget; increasing that flag should be a deliberate choice, not the default. Codex workers no longer run the held-out match benchmark themselves. They patch, optionally do one tiny local sanity check, and stop; the coordinator runs an `8`-position screen on every eligible patch and only runs the heavier final benchmark on the best screen winner.
211
+
212
+ For `--surface search`, the coordinator freezes a baseline engine snapshot for the round and uses [benchmark_engine.py](train/benchmark_engine.py) so each side gets its own `search.py` plus its own eval. That is the safe path once workers are allowed to touch search heuristics.
213
+
214
+ The coordinator tries real git worktrees first and falls back to lightweight local clones under `/tmp/0x960-codex-swarm/` when worktree metadata is not writable. Swarm state and accepted challengers are recorded under `outputs/codex_swarm/`. The fast default is now a 3-worker wave, and the coordinator reorders hook lanes each round so empty hooks are targeted first, then simple passthrough hooks, and already-customized hooks last.
215
+
216
+ Each worker now gets a small local research pack before it edits:
217
+
218
+ - `AGENTS.md`, `README.md`, and [Codex Swarm Plan](docs/codex-swarm-plan.md)
219
+ - benchmark scripts in `train/`
220
+ - the current champion snapshot
221
+ - the swarm ledger
222
+ - accepted historical winners under `outputs/codex_swarm/accepted/`
223
+
224
+ The default roles are:
225
+
226
+ - `worker-1`: Structure Researcher
227
+ - `worker-2`: Tactical Safety Researcher
228
+ - `worker-3`: Activity Researcher
229
+ - `worker-4`: Pawn-Endgame Researcher
230
+ - `worker-5`: Initiative Tuner
231
+
232
+ Workers still edit only one file per round. On the default `eval` surface they patch `src/zero960/workspace_template/eval.py`; on the `search` surface they patch `src/zero960/engine/search.py`. In both modes they can inspect the full local research pack to avoid repeating prior winners and to justify their patch against actual benchmark history.
233
+
234
+ To copy the current swarm champion back into the source tree:
235
+
236
+ ```sh
237
+ uv run python -m train.codex_swarm promote
238
+ ```
239
+
240
+ ## Notes
241
+
242
+ - The environment already includes the current `eval.py` contents in each observation.
243
+ - Reward shaping now favors valid edits, explicit `run_match`, and clean `finish`.
244
+ - Invalid writes are rolled back immediately so bad code does not poison the rest of the episode.
245
+
246
+ ## Docs
247
+
248
+ - [Architecture](docs/architecture.md)
249
+ - [Codex Swarm Plan](docs/codex-swarm-plan.md)
250
+ - [Why Chess960](docs/why_chess960.md)
251
+ - [Demo Script](docs/demo-script.md)
252
+ - [Process Log](docs/process.md)
docs/agent-log-instruction.md DELETED
@@ -1,30 +0,0 @@
1
- # agent log instruction
2
-
3
- Use this as a reusable instruction snippet for coding agents working on 0x960.
4
-
5
- ## short snippet
6
-
7
- After each meaningful implementation step, append a short entry to `docs/process.md`.
8
-
9
- Each entry should:
10
-
11
- - use the current timestamp
12
- - summarize what changed in 2-5 factual bullets
13
- - note any important decisions or blockers
14
- - end with a clear next step
15
-
16
- Do not paste large raw command outputs into the log. Summarize them instead.
17
-
18
- ## longer snippet
19
-
20
- You are working in the 0x960 repo. Maintain `docs/process.md` as the project build log.
21
-
22
- Rules:
23
-
24
- - append to the file after each meaningful work block, not after every micro-step
25
- - keep entries concise and factual
26
- - include what changed, why it changed, blockers, and the next step
27
- - prefer evidence summaries over raw terminal dumps
28
- - optimize the log for demo storytelling and judge review
29
-
30
- If you make a product or architecture decision, record it. If a test fails, record the failure briefly and say what remains to fix.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/architecture.md CHANGED
@@ -1,188 +1,71 @@
1
- # architecture
2
 
3
- ## stack decisions
4
 
5
- These are fixed by the hackathon or by scope discipline:
6
 
7
- - environment interface: OpenEnv `0.2.1`
8
- - deployment target: HF Spaces
9
- - training demo: HF TRL or Unsloth in Colab
10
- - core model class: open-weight OSS model only
11
- - optional infra for real training: Northflank H100
 
 
 
12
 
13
- Closed frontier models are not part of the core training path. If we use them at all, they are comparison-only in the demo layer.
14
 
15
- ## system shape
16
 
17
- The system should have four layers.
 
 
 
 
18
 
19
- ### 1. engine workspace
20
 
21
- A minimal Chess960 engine scaffold with:
22
 
23
- - fixed move generation
24
- - fixed search implementation
25
- - fixed tournament runner
26
- - one narrow editable surface
27
-
28
- Recommended editable surface:
29
-
30
- - `engine/eval.py`
31
-
32
- Optional later extension:
33
-
34
- - `engine/weights.json`
35
-
36
- The whole repo should not be editable by the policy. Narrow edit scope keeps training stable and makes the story legible.
37
-
38
- ### 2. environment runtime
39
-
40
- The environment owns the full episode lifecycle:
41
-
42
- 1. clone a fresh engine workspace
43
- 2. sample one Chess960 start or a small suite of starts
44
- 3. expose bounded actions to the model
45
- 4. execute actions and return observations
46
- 5. after the step budget is exhausted, run matches
47
- 6. compute reward and terminate
48
-
49
- The environment should be written as a normal Python runtime first, then wrapped cleanly for OpenEnv.
50
-
51
- ### 3. reward and evaluation harness
52
-
53
- This layer runs fast matches between the edited engine and baselines.
54
-
55
- It should provide:
56
-
57
- - training reward matches
58
- - held-out evaluation matches
59
- - crash handling
60
- - reproducible position sampling
61
-
62
- ### 4. training loop
63
-
64
- The training loop should use:
65
-
66
- - GRPO or equivalent in TRL/Unsloth
67
- - a rollout function that runs a full episode
68
- - checkpoint logging
69
- - reward curves and crash-rate metrics
70
-
71
- The training loop is minimal by design. The goal is to show the environment can produce a learnable signal, not to max out Elo during the hackathon.
72
-
73
- ## episode contract
74
-
75
- ### observation
76
-
77
- Each step should return a compact structured observation containing:
78
 
79
  - the task instruction
80
- - current editable file contents
81
  - recent action history
82
- - recent command outputs or error messages
83
- - remaining step budget
84
- - start-position metadata
85
-
86
- ### actions
87
-
88
- Start with structured actions, not open shell access.
89
-
90
- - `read_file(path)`
91
- - `write_file(path, content)`
92
- - `run_static_eval()`
93
- - `run_match()`
94
- - `finish()`
95
-
96
- If needed, a restricted shell tool can be added later, but it should not be required for MVP.
97
-
98
- ### termination
99
-
100
- An episode ends when:
101
-
102
- - the agent calls `finish()`
103
- - the step budget is exhausted
104
- - the workspace becomes invalid in a fatal way
105
-
106
- ## reward design
107
-
108
- Default reward for MVP:
109
-
110
- - primary reward: match score against a fixed baseline engine
111
- - penalty: invalid edit, crash, or timeout
112
-
113
- Recommended first-pass formula:
114
-
115
- `reward = score_vs_fixed_baseline - crash_penalty`
116
-
117
- Do not make parent-checkpoint self-play the only reward. If we use it, it should be a secondary signal only.
118
-
119
- ## evaluation protocol
120
-
121
- Training and evaluation must be separated.
122
-
123
- ### training
124
-
125
- - sample Chess960 starts from a training pool
126
- - play a small number of fast games against the fixed baseline
127
- - compute reward
128
-
129
- ### held-out eval
130
-
131
- - separate fixed start-position suite
132
- - fixed baseline configuration
133
- - fixed game count and time control
134
- - run periodically on saved checkpoints
135
-
136
- This is how we avoid fooling ourselves with a rising training reward that does not correspond to stronger engines.
137
-
138
- ## model strategy
139
-
140
- We should optimize for stable tool behavior, not for the largest model possible.
141
-
142
- Recommended order:
143
-
144
- 1. `Qwen3.5-9B`
145
- 2. one backup model with good coding/tool-use behavior
146
- 3. only try larger models if the smaller path is already stable
147
-
148
- Single-H100-safe priority:
149
-
150
- - dense 7B to 14B class models first
151
- - larger MoE models only if integration is already working
152
-
153
- ## speed target
154
-
155
- A good MVP episode should be cheap enough to run many times.
156
-
157
- Target envelope:
158
 
159
- - step budget: `4-8` actions
160
- - match count: very small during training
161
- - episode runtime: ideally under `30s`
162
 
163
- If episodes are too slow, we reduce game count before we add complexity elsewhere.
164
 
165
- ## deployment
166
 
167
- ### HF Spaces
168
 
169
- HF Spaces hosts the OpenEnv environment and provides the submission artifact judges can inspect.
 
 
 
170
 
171
- ### Colab
172
 
173
- Colab provides the minimal public training notebook using TRL or Unsloth.
174
 
175
- ### Northflank
176
 
177
- Northflank is the practical training box if we want a real H100-backed run, but it is not required for the minimal architecture itself.
 
 
 
 
 
178
 
179
- ## deferred work
180
 
181
- These are explicitly outside the MVP:
182
 
183
- - frontier model integrations
184
- - OAuth-based coding agent sessions
185
- - multi-agent swarm variants
186
- - Elo dashboards
187
- - tournament leagues across many checkpoints
188
- - full ACP-like unrestricted workspace tooling
 
1
+ # Architecture
2
 
3
+ ## Core Shape
4
 
5
+ 0x960 has four moving parts:
6
 
7
+ 1. `src/zero960/engine/`
8
+ A minimal Chess960 engine with fixed search and one narrow editable surface: `eval.py`.
9
+ 2. `src/zero960/runtime/`
10
+ The episode runtime that owns workspace resets, bounded actions, reward shaping, and match scoring.
11
+ 3. `src/zero960_env/`
12
+ The OpenEnv wrapper and WebSocket client/server layer.
13
+ 4. `train/`
14
+ Distillation and RL entrypoints that operate on the same bounded action schema.
15
 
16
+ ## Action Space
17
 
18
+ The policy only gets structured actions:
19
 
20
+ - `read_file`
21
+ - `write_file`
22
+ - `run_static_eval`
23
+ - `run_match`
24
+ - `finish`
25
 
26
+ The full repo is not editable. The policy can only modify `eval.py` inside a fresh workspace.
27
 
28
+ ## Observation Shape
29
 
30
+ Each observation includes:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
  - the task instruction
33
+ - the current `eval.py` contents
34
  - recent action history
35
+ - remaining steps
36
+ - last match score
37
+ - workflow hints and suggested next actions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
+ The current file contents are already visible in the observation, so the intended high-reward loop is:
 
 
40
 
41
+ `write_file -> run_match -> finish`
42
 
43
+ ## Reward Design
44
 
45
+ Reward is match-score-based with explicit shaping around the edit loop:
46
 
47
+ - positive signal for valid changed writes
48
+ - positive signal for explicit `run_match` after a write
49
+ - penalties for repeated `run_static_eval`, redundant `read_file`, and finishing without a meaningful edit/test cycle
50
+ - invalid writes are rolled back immediately
51
 
52
+ This keeps the environment learnable while still grounding the main score in downstream engine strength.
53
 
54
+ ## Training Strategy
55
 
56
+ Current order of operations:
57
 
58
+ 1. teacher distillation
59
+ Use a strong coding model such as Codex/GPT-5.4 to generate successful bounded-action trajectories.
60
+ 2. student fine-tuning
61
+ Fine-tune a smaller open model on those trajectories.
62
+ 3. RL refinement
63
+ Use GRPO or a similar method only after the student already knows the workflow.
64
 
65
+ This is the main shift from the earlier RL-first plan. The hard part has been action discovery, not just optimization.
66
 
67
+ ## Deployment
68
 
69
+ - HF Spaces: public OpenEnv artifact
70
+ - Northflank H100: practical heavy training and debugging box
71
+ - local dev: fastest loop for environment and prompt iteration
 
 
 
docs/codex-swarm-plan.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Codex Swarm Plan
2
+
3
+ This is the current highest-leverage path for building a stronger Chess960 eval engine quickly.
4
+
5
+ ## Goal
6
+
7
+ Use multiple Codex workers as an outer-loop engine lab:
8
+
9
+ - propose changes to `eval.py` and small search heuristics
10
+ - benchmark every candidate against the current champion
11
+ - keep only Elo-positive patches
12
+ - periodically distill the best traces back into a smaller open student
13
+
14
+ The point is not to replace the OpenEnv environment. The point is to use strong coding agents to search the engine-design space faster than raw RL can.
15
+
16
+ ## Why This Path
17
+
18
+ The project has already shown two things:
19
+
20
+ - base small models do not reliably discover the edit loop on their own
21
+ - a distilled student can learn `write_file -> run_match -> finish`
22
+
23
+ That solves workflow compliance, but the actual submission claim needs to be engine strength. The next bottleneck is finding better eval/search ideas, not teaching the loop again from scratch.
24
+
25
+ ## Worker Architecture
26
+
27
+ Run one coordinator plus several parallel Codex workers locally by default. Use the H100 only for heavy benchmark or training jobs.
28
+
29
+ - Coordinator:
30
+ - assigns experiment ideas
31
+ - tracks the current champion engine
32
+ - merges only benchmark-positive patches
33
+ - Worker:
34
+ - runs in its own git worktree when possible, with a lightweight local clone fallback when the environment cannot write `.git/worktrees`
35
+ - researches the current champion, accepted history, and benchmark code before editing
36
+ - edits a narrow surface area
37
+ - returns one bounded patch plus a short rationale
38
+ - lets the coordinator run the held-out benchmark and promotion gate
39
+
40
+ The default fast wave should use 3 workers. The coordinator should re-rank hook lanes each round so empty hooks are targeted first, then simple passthrough hooks, and already-customized hooks last. Workers remain specialist researcher-implementers with read-only access to:
41
+
42
+ - `AGENTS.md`, `README.md`, and this plan
43
+ - `train/benchmark_eval.py`, `train/benchmark_league.py`, and `train/benchmark_uci.py`
44
+ - the current champion at `outputs/codex_swarm/champion_eval.py`
45
+ - the promotion ledger at `outputs/codex_swarm/ledger.jsonl`
46
+ - all accepted snapshots under `outputs/codex_swarm/accepted/`
47
+
48
+ The available specialist roles are:
49
+
50
+ - worker 1: Structure Researcher
51
+ king safety and castling structure in Chess960 starts
52
+ - worker 2: Tactical Safety Researcher
53
+ loose-piece pressure, attacked-undefended pieces, and practical safety terms
54
+ - worker 3: Activity Researcher
55
+ piece activity, development, space, and centralization at shallow search depth
56
+ - worker 4: Pawn-Endgame Researcher
57
+ pawn structure, passed pawns, rook files, and simple endgame conversion terms
58
+ - worker 5: Initiative Tuner
59
+ tempo, mobility pressure, queen safety, and initiative terms that convert shallow-search advantages faster
60
+
61
+ After each promotion, the coordinator should automatically deprioritize the lane that just gained custom logic and spend the next short wave on the emptier hooks.
62
+
63
+ There are now two practical swarm surfaces:
64
+
65
+ - `eval` surface:
66
+ workers edit only `src/zero960/workspace_template/eval.py` and benchmark with `train/benchmark_eval.py`
67
+ - `search` surface:
68
+ workers edit only `src/zero960/engine/search.py` and benchmark with `train/benchmark_engine.py` so each side gets its own eval plus its own searcher
69
+
70
+ ## Evaluation Loop
71
+
72
+ Every candidate should go through the same loop:
73
+
74
+ 1. read current engine code and latest benchmark results
75
+ 2. make one bounded patch
76
+ 3. stop quickly and hand the patch back to the coordinator
77
+ 4. let the coordinator run a cheap screen benchmark first
78
+ 5. run a heavier final benchmark only on the best screen winner
79
+ 6. keep only patches that improve held-out score or estimated Elo delta
80
+
81
+ Preferred rule:
82
+
83
+ - no patch is promoted unless it beats the current champion on a fixed held-out benchmark set
84
+ - each worker should make one bounded patch and stop; the coordinator owns held-out benchmarking
85
+ - benchmark in stages: cheap screen on all eligible candidates, heavier final check only for the best screen winner
86
+ - workers that run too long should be timed out rather than left to wander
87
+ - workers should inspect accepted history first so lanes diverge instead of repeating the same patch four times
88
+
89
+ ## Safety Constraints
90
+
91
+ Keep the search legible and hard to game.
92
+
93
+ - edit only engine files, benchmark code, or clearly scoped support code
94
+ - no dependency churn unless explicitly needed
95
+ - no broad repo rewrites
96
+ - benchmark on held-out Chess960 starts, not only the training positions
97
+ - record candidate, benchmark settings, and result for every accepted patch
98
+
99
+ ## Local Setup
100
+
101
+ The default shape is:
102
+
103
+ - install Codex CLI locally
104
+ - log in locally
105
+ - create multiple git worktrees next to the main repo
106
+ - run several Codex workers in parallel from those worktrees
107
+ - keep the main repo as the coordinator / champion branch
108
+
109
+ Useful local pattern:
110
+
111
+ - main repo: champion branch and benchmark history
112
+ - `/tmp/0x960-codex-swarm/worker-1`
113
+ - `/tmp/0x960-codex-swarm/worker-2`
114
+ - `/tmp/0x960-codex-swarm/worker-3`
115
+
116
+ If device auth is used, `codex login --device-auth` will print a one-time URL and code. If API-key auth is easier, `codex login --with-api-key` is also fine.
117
+
118
+ ## Optional H100 Use
119
+
120
+ The H100 is still useful, but not as the primary Codex host.
121
+
122
+ - run large held-out benchmarks there
123
+ - run student SFT there
124
+ - run RL refinement there if needed later
125
+
126
+ This keeps Codex orchestration simple while still using the GPU box where it actually matters.
127
+
128
+ ## Relationship To OpenEnv
129
+
130
+ OpenEnv is still the core environment and submission artifact.
131
+
132
+ This swarm loop is an outer optimization layer:
133
+
134
+ - OpenEnv remains the bounded agent environment
135
+ - teacher traces can still be collected from Codex in the bounded action schema
136
+ - the best engine patches found by the swarm can become:
137
+ - stronger workspace templates
138
+ - better baselines
139
+ - better teacher data
140
+ - better student targets
141
+
142
+ ## Immediate Next Steps
143
+
144
+ 1. Finish local Codex CLI auth.
145
+ 2. Run `uv run python -m train.codex_swarm setup --workers 3`.
146
+ 3. Start with `uv run python -m train.codex_swarm run --workers 3 --rounds 1 --model gpt-5.3-codex --screen-positions 8 --positions 16 --worker-timeout-sec 180 --max-diff-lines 80`.
147
+ 4. Start on the `eval` surface only until the hook lanes are no longer giving clean wins, then open the `search` surface.
148
+ 5. Promote only patches that improve held-out benchmark score.
149
+ 6. Use the H100 only for heavier benchmark or training passes.
150
+ 7. Distill the accepted traces back into the student model after enough wins accumulate.
151
+
152
+ For a longer autonomous loop, use:
153
+
154
+ ```sh
155
+ uv run python -m train.codex_swarm run \
156
+ --workers 5 \
157
+ --continuous \
158
+ --max-stall-rounds 3 \
159
+ --model gpt-5.3-codex \
160
+ --screen-positions 8 \
161
+ --positions 16 \
162
+ --max-diff-lines 80 \
163
+ --worker-timeout-sec 180
164
+ ```
165
+
166
+ The coordinator should stay opinionated about patch size. Recent Codex waves tended to rewrite nearly the whole file even when the actual improvement was a one-line or one-function tweak, so the current default rejects candidates that exceed the `--max-diff-lines` budget.
167
+
168
+ This keeps running until interrupted or until several consecutive rounds fail to promote a new champion.
docs/concept.md DELETED
@@ -1,72 +0,0 @@
1
- # 0x960
2
-
3
- ## what this is
4
-
5
- 0x960 is an OpenEnv environment where a model improves a minimal Chess960 engine by making bounded code edits to its evaluation logic.
6
-
7
- The agent does not play chess directly. It reads engine files, edits the eval function, runs checks, and is rewarded by match performance against a fixed baseline.
8
-
9
- ## hackathon fit
10
-
11
- This project is designed to satisfy the OpenEnv hackathon constraints:
12
-
13
- - use OpenEnv `0.2.1`
14
- - deploy the environment on HF Spaces
15
- - provide a minimal training script in Colab using HF TRL or Unsloth
16
-
17
- ## core claim
18
-
19
- The interesting task is not "can an LLM output a good chess move?"
20
-
21
- The interesting task is:
22
-
23
- - can a model operate inside a real coding environment
24
- - make multi-step edits to a live system
25
- - and improve that system under an objective downstream metric
26
-
27
- Chess960 is useful because opening memorization is much less valuable than in standard chess. That makes engine improvement a better fit for an agentic environment than pure next-move prediction.
28
-
29
- ## novelty claim
30
-
31
- We should not claim that tool use is new or that Chess960 benchmarking is new.
32
-
33
- The stronger and more defensible claim is:
34
-
35
- - Chess960 engine evaluation is an existing benchmark domain
36
- - coding agents with tool use are an existing capability pattern
37
- - 0x960 combines them into a self-improvement RL environment where the model modifies engine code and is rewarded by actual engine strength
38
-
39
- ## why this is a good OpenEnv task
40
-
41
- - it is multi-step, not single-step classification
42
- - reward comes from a real external process: engine matches
43
- - the agent interacts with files and commands, not just text
44
- - failure modes are meaningful: bad edits, crashes, invalid code, weak evals
45
-
46
- This aligns best with:
47
-
48
- - Statement 3.1: Professional Tasks
49
- - Statement 4: Self-Improvement
50
-
51
- ## MVP
52
-
53
- The MVP should be intentionally narrow.
54
-
55
- - one minimal Chess960 engine scaffold
56
- - one fixed search implementation
57
- - one narrow editable surface: `eval.py` or `weights.json`
58
- - one fixed baseline opponent
59
- - one held-out evaluation suite
60
- - one training path using OpenEnv + TRL/Unsloth
61
-
62
- ## non-goals for MVP
63
-
64
- - no frontier model dependency in the training loop
65
- - no OAuth or hosted coding-agent integration
66
- - no multi-agent swarm
67
- - no broad repo-wide code editing
68
- - no polished Elo dashboard unless the core loop already works
69
-
70
- ## practical pitch
71
-
72
- "We built an OpenEnv environment where a model learns to be a Chess960 engine engineer, not a chess player. The model uses bounded coding actions to improve an engine's eval function, and reward comes from whether the edited engine actually performs better."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/demo-script.md CHANGED
@@ -1,31 +1,31 @@
1
- # one-minute demo script
2
 
3
- ## 30-second version
4
 
5
- We built 0x960, an OpenEnv environment where a model learns to be a Chess960 engine engineer, not a chess player. Chess960 is useful because it removes much of the opening memorization that standard chess systems can rely on, making it a stronger test of generalization. In our environment, the model gets a bounded coding workspace, edits the engine's eval function, runs checks, and is rewarded by whether the edited engine actually performs better against a fixed baseline. The training signal comes from real downstream engine strength, not just text imitation or next-move prediction.
6
 
7
- ## full one-minute outline
8
 
9
- ### 1. opening
10
 
11
- 0x960 is an OpenEnv environment for training models to improve a Chess960 engine through bounded code edits.
12
 
13
- ### 2. why this task
14
 
15
- Chess960 keeps the rules of chess the same but randomizes the starting position, so it is a cleaner test of robustness than standard chess alone.
16
 
17
- ### 3. what the model does
18
 
19
- The model does not play chess directly. It reads engine files, edits `eval.py`, runs checks, and decides when to finish.
20
 
21
- ### 4. reward
22
 
23
- After the edit budget is used, the engine plays fast matches against a fixed baseline. Reward is based on match score, with penalties for invalid edits or crashes.
24
 
25
- ### 5. why OpenEnv
26
 
27
- This is a real multi-step tool-use task with files, commands, failures, and downstream evaluation. That makes it a strong fit for Statement 3.1 and Statement 4.
28
 
29
- ### 6. close
30
 
31
- The result is a self-improvement environment where the model learns to engineer a stronger Chess960 system, not just imitate chess moves.
 
1
+ # Demo Script
2
 
3
+ ## 30-Second Version
4
 
5
+ 0x960 is an OpenEnv environment where a model learns to act like a Chess960 engine engineer, not a chess player. The model gets a bounded coding workspace, edits `eval.py`, tests the change with fast matches, and is rewarded by whether the engine actually improves. We found that raw RL alone struggled because base models did not discover the edit loop, so the current path is teacher distillation first and RL refinement second.
6
 
7
+ ## One-Minute Outline
8
 
9
+ ### 1. Opening
10
 
11
+ 0x960 is a bounded self-improvement environment for a minimal Chess960 engine.
12
 
13
+ ### 2. Why Chess960
14
 
15
+ Chess960 keeps the rules of chess fixed while changing the starting position, so it is a cleaner robustness test than standard chess alone.
16
 
17
+ ### 3. What the Agent Does
18
 
19
+ The policy sees the current `eval.py`, writes a bounded replacement, runs a match, and decides when to finish.
20
 
21
+ ### 4. Why Teacher Distillation
22
 
23
+ Base models were not discovering `write_file` reliably, so we added a teacher path: collect successful bounded-action trajectories from a stronger coding agent, fine-tune a smaller open model on those traces, then use RL to refine it.
24
 
25
+ ### 5. Why OpenEnv
26
 
27
+ This is a real multi-step tool-use task with code edits, failures, and downstream evaluation. The reward comes from engine strength, not proxy text metrics.
28
 
29
+ ### 6. Close
30
 
31
+ The result is a self-improvement environment where the model learns a real engineering workflow instead of just outputting moves or text.
docs/open-questions.md DELETED
@@ -1,85 +0,0 @@
1
- # open questions
2
-
3
- ## blockers to resolve first
4
-
5
- 1. **engine skeleton**
6
-
7
- What is the smallest Chess960 engine we can ship quickly while still making eval edits meaningful?
8
-
9
- Default assumption:
10
-
11
- - python move generation
12
- - fixed search
13
- - pluggable eval module
14
-
15
- 2. **OpenEnv integration**
16
-
17
- What is the thinnest wrapper needed to expose the environment through OpenEnv `0.2.1` and still support a multi-step episode?
18
-
19
- We should prefer the simplest compliant implementation over clever abstractions.
20
-
21
- 3. **training loop shape**
22
-
23
- What is the smallest public Colab example that proves the reward loop works with TRL or Unsloth?
24
-
25
- The goal is not large-scale training in Colab. The goal is to show a valid training script and some observable reward signal.
26
-
27
- 4. **baseline and held-out suite**
28
-
29
- We need one fixed training baseline and one fixed held-out evaluation suite.
30
-
31
- If the baseline is too weak, reward saturates. If it is too strong, the policy gets no signal.
32
-
33
- 5. **episode speed**
34
-
35
- How many games can we afford per episode while keeping iteration tight enough to show learning during the hackathon?
36
-
37
- ## defaults unless they fail
38
-
39
- These are no longer open-ended research questions. They are the default implementation choices until proven insufficient.
40
-
41
- 1. **model**
42
-
43
- Start with a small open model in the `7B-14B` range, with `Qwen3.5-9B` as the default first candidate.
44
-
45
- 2. **action space**
46
-
47
- Use structured actions, not unrestricted shell access.
48
-
49
- 3. **editable surface**
50
-
51
- Restrict writes to `eval.py` and optionally `weights.json`.
52
-
53
- 4. **reward**
54
-
55
- Use fixed-baseline match score with a crash penalty.
56
-
57
- 5. **comparison models**
58
-
59
- Do not use frontier closed models in the core training loop.
60
-
61
- ## possible upgrades if time remains
62
-
63
- 1. **parent-checkpoint reward**
64
-
65
- Add score against the previous checkpoint or a small checkpoint pool as an auxiliary curriculum signal, not as the only reward.
66
-
67
- 2. **frontier comparison**
68
-
69
- Run a closed frontier coding agent in the same environment for demo purposes only.
70
-
71
- 3. **visualization**
72
-
73
- Plot reward curves, checkpoint strength, and action traces.
74
-
75
- 4. **league evaluation**
76
-
77
- Run small tournaments among checkpoints to show progression over time.
78
-
79
- ## explicitly deferred
80
-
81
- - multi-agent or swarm architectures
82
- - OAuth integration
83
- - unrestricted ACP-style terminal access
84
- - large-model training beyond what a single H100 can support comfortably
85
- - polished benchmark packaging beyond the hackathon submission
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/process.md CHANGED
@@ -15,6 +15,32 @@ Logging rules:
15
  - include decisions, blockers, and concrete next steps
16
  - summarize command/test results instead of pasting long raw output
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ## 2026-03-07 17:10 PST
19
 
20
  - Read the initial project docs and collapsed the scope from a broad research wishlist into a narrow hackathon MVP.
@@ -103,3 +129,358 @@ Logging rules:
103
  - Also hit vLLM 0.17 vs transformers 5.3 incompatibility (vLLM wants <5.0). Dropped vLLM for now, using native HF generation.
104
  - Training running on Northflank H100 with QLoRA + gradient checkpointing.
105
  - Next: confirm training completes, check reward progression, update docs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  - include decisions, blockers, and concrete next steps
16
  - summarize command/test results instead of pasting long raw output
17
 
18
+ ## 2026-03-08 01:05 PST
19
+
20
+ - Upgraded the local Codex swarm prompt in [train/codex_swarm.py](../train/codex_swarm.py) from generic lanes to explicit specialist researcher-implementer roles.
21
+ - Workers now receive a local research pack before patching: `AGENTS.md`, `README.md`, the swarm plan, benchmark scripts, the current champion snapshot, the swarm ledger, and accepted historical winners copied into each worker sandbox.
22
+ - Kept the editable surface narrow at `src/zero960/workspace_template/eval.py` so promotion still measures one variable cleanly, while making accepted history visible so workers can differentiate instead of repeating the same rewrite.
23
+ - Updated [README.md](../README.md) and [docs/codex-swarm-plan.md](./codex-swarm-plan.md) to match the new role-based swarm shape.
24
+
25
+ ## 2026-03-08 01:20 PST
26
+
27
+ - Expanded the default local swarm from 4 to 5 workers by adding an `Initiative Tuner` role in [train/codex_swarm.py](../train/codex_swarm.py).
28
+ - Added continuous swarm mode with `--continuous`, `--max-stall-rounds`, and `--sleep-sec` so the coordinator can keep running promotion rounds until interrupted or until it stalls.
29
+ - Kept promotion eval-focused because the current benchmark path only measures `eval.py` cleanly; search edits still need a separate promotion harness before they should be opened up.
30
+ - Updated [README.md](../README.md) and [docs/codex-swarm-plan.md](./codex-swarm-plan.md) with the 5-worker defaults and the long-running loop command.
31
+
32
+ ## 2026-03-08 01:35 PST
33
+
34
+ - Found and fixed a coordinator bug in [train/codex_swarm.py](../train/codex_swarm.py): worker sandboxes were copying the repo `workspace_template/eval.py` instead of overwriting it with the frozen swarm champion before Codex started.
35
+ - Stopped the invalid live five-worker loop, patched `_sync_worker_snapshot()` to copy `outputs/codex_swarm/champion_eval.py` into each worker's editable `eval.py`, and prepared to restart the loop from a valid champion snapshot.
36
+ - Confirmed this also explains why all five workers initially converged on the same hash despite the new specialist-role prompts.
37
+
38
+ ## 2026-03-08 01:50 PST
39
+
40
+ - Added a separate search-safe benchmark harness in [train/benchmark_engine.py](../train/benchmark_engine.py).
41
+ - This harness benchmarks two full engine roots against each other, loading each side's own `select_move()` from its own `src/zero960/engine/search.py` plus its own eval file, instead of sharing the live repo search module.
42
+ - Kept the main swarm promotion gate unchanged for now; this new harness is the prerequisite for later opening `search.py` edits without corrupting head-to-head comparisons.
43
+
44
  ## 2026-03-07 17:10 PST
45
 
46
  - Read the initial project docs and collapsed the scope from a broad research wishlist into a narrow hackathon MVP.
 
129
  - Also hit vLLM 0.17 vs transformers 5.3 incompatibility (vLLM wants <5.0). Dropped vLLM for now, using native HF generation.
130
  - Training running on Northflank H100 with QLoRA + gradient checkpointing.
131
  - Next: confirm training completes, check reward progression, update docs.
132
+
133
+ ## 2026-03-08 09:30 PST
134
+
135
+ - Inspected rollout logs and confirmed the failure mode was policy-level, not pure model size: the agent kept choosing `run_static_eval` or early `finish` and rarely attempted a code edit.
136
+ - Tightened the runtime reward shaping around the intended workflow in `src/zero960/runtime/episode.py`: valid changed writes now get an immediate bonus, explicit `run_match` after a write is rewarded, repeated `run_static_eval` and wasted `read_file` calls are penalized, and finishing without an edit or explicit match is penalized.
137
+ - Changed write handling so `write_file` validates `eval.py` immediately by loading `evaluate(board)` and rolls back invalid edits instead of leaving the episode in a broken workspace.
138
+ - Extended observations with workflow hints and suggested next actions so the policy sees explicit guidance like "write first" and "run_match next" after each step.
139
+ - Reworked `train/minimal_trl_openenv.py` prompt instructions to state that `eval.py` is already visible, show the preferred `write_file -> run_match -> finish` sequence, and reduced completion length from 512 to 256 to bias toward compact JSON outputs.
140
+ - Replaced the brittle regex JSON parser with a brace-balanced extractor so `write_file` actions containing nested braces in Python code are more likely to parse correctly.
141
+ - Next: run fresh `infer` and short GRPO checks to see whether the action distribution shifts from `run_static_eval`/`finish` toward `write_file`.
142
+
143
+ ## 2026-03-08 11:10 PST
144
+
145
+ - Added `train/codex_distill.py`, a teacher-data collection path that runs Codex through the same bounded Zero960 action schema and writes both raw rollout traces and SFT-ready chat samples.
146
+ - Kept the teacher constrained to one JSON action per turn with a strict output schema, so the collected data matches the student policy interface instead of leaking shell/editor tool use.
147
+ - Simplified the top-level docs around the current strategy: distillation first, RL refinement second.
148
+ - Deleted redundant planning/research docs that mostly restated old RL-first assumptions and rewrote the README to point at the active entrypoints and docs that still matter.
149
+ - Added generated training artifacts to `.gitignore` so local and remote runs stop cluttering the worktree.
150
+ - Next: run a first short Codex teacher collection against the live env and inspect how many traces survive the reward filter.
151
+
152
+ ## 2026-03-08 21:20 PST
153
+
154
+ - Added `train/sft_student.py`, a minimal student fine-tuning entrypoint that reads the exported `sft_samples_*.jsonl` files, validates the assistant action payloads, drops malformed legacy rows, deduplicates identical chats, and trains with TRL `SFTTrainer`.
155
+ - Kept the dataset conversational and enabled `assistant_only_loss` so the student is optimized on the teacher’s bounded JSON action turn, not on reproducing the prompt text.
156
+ - Added a dry-run mode and basic dataset stats so the repo can inspect the current teacher corpus before launching a real training job.
157
+ - Updated the README with the new student-SFT command, keeping the distill-first flow explicit.
158
+ - Next: run a dry-load smoke test locally, then train a first 0.8B student checkpoint and compare `infer` behavior against the base model.
159
+
160
+ ## 2026-03-08 22:05 PST
161
+
162
+ - Re-read the project docs and aligned the next work with the intended claim: reward should reflect downstream Chess960 engine strength, not only loop compliance.
163
+ - Replaced the toy default eval with a stronger Chess960-safe heuristic in both `src/zero960/engine/default_eval.py` and `src/zero960/workspace_template/eval.py`, adding pawn-structure, mobility, center control, rook-file, king-safety, castling-rights, bishop-pair, and development terms.
164
+ - Added simple move ordering to `src/zero960/engine/search.py` so shallow alpha-beta spends more time on captures, checks, promotions, and castling moves.
165
+ - Updated the deterministic training write path in `train/minimal_trl_openenv.py` to make small valid edits against the new eval constants instead of the old toy eval body.
166
+ - Added `train/benchmark_eval.py` plus a README command so candidate eval files can be compared against a baseline on held-out Chess960 start positions with an estimated Elo delta.
167
+ - Next: run the benchmark on the H100 against saved candidate evals and use that metric, not shaped reward alone, to judge whether future training actually improves play.
168
+
169
+ ## 2026-03-08 22:40 PST
170
+
171
+ - Ran the first remote student SFT job on the Northflank H100 against the merged teacher corpus (`105` clean rows / `35` successful episodes); the job finished cleanly in about `5m 11s`.
172
+ - Final SFT metrics on the remote run were strong for this narrow dataset: train loss `0.2072`, eval loss `0.04192`, eval token accuracy `0.9876`.
173
+ - Compared `infer` behavior on the H100: base `Qwen/Qwen3.5-0.8B` still spammed `run_static_eval` for all six steps and ended at reward `-2.1`, while the SFT checkpoint executed the intended `write_file -> run_match -> finish` loop in three steps for reward `1.0`.
174
+ - Updated `train/minimal_trl_openenv.py` so infer mode can accept a separate tokenizer path when evaluating checkpoints that do not bundle tokenizer files at the model root.
175
+ - Next: run a small batched eval of base vs SFT student across multiple episodes and then decide whether to add GRPO refinement or collect more teacher data first.
176
+
177
+ ## 2026-03-08 23:05 PST
178
+
179
+ - Wrote down the new higher-level direction in `docs/codex-swarm-plan.md`: use multiple Codex workers on the H100 as a champion/challenger engine-iteration loop, benchmark every candidate, and only keep Elo-positive patches.
180
+ - Kept the repo story explicit: OpenEnv remains the core environment and submission artifact, while the Codex swarm acts as an outer optimization layer for discovering stronger engine code and better teacher traces.
181
+ - Updated the README to link this plan so the project direction is visible without digging through chat history.
182
+ - Next: finish Codex CLI auth on the H100, create isolated worker worktrees, and start with a small `eval.py`-only worker swarm before broadening the editable surface.
183
+
184
+ ## 2026-03-08 23:20 PST
185
+
186
+ - Simplified the Codex swarm plan: local Codex workers are now the default orchestration path, while the H100 is treated as an optional heavy-compute box for larger benchmarks and training.
187
+ - Updated `docs/codex-swarm-plan.md` to reflect the practical setup that avoids remote Node/npm bootstrap friction and keeps the worker loop easier to debug.
188
+ - Updated the README wording so the Codex outer loop is clearly described as a local worker swarm rather than an H100-hosted agent farm.
189
+ - Next: finish local Codex auth, create 3 local worker worktrees, and start with an `eval.py`-only champion/challenger loop.
190
+
191
+ ## 2026-03-08 23:40 PST
192
+
193
+ - Added `train/codex_swarm.py`, a runnable local coordinator for the new champion/challenger loop instead of leaving the swarm idea only in docs.
194
+ - The coordinator initializes a champion eval snapshot under `outputs/codex_swarm/champion_eval.py`, spins up worker sandboxes under `/tmp/0x960-codex-swarm/`, and runs one Codex worker per sandbox against the same frozen champion each round.
195
+ - Worker setup now tries `git worktree add` first and falls back to a lightweight local `git clone --shared` when `.git/worktrees` cannot be written, which makes the swarm usable in stricter local sandboxes too.
196
+ - Round execution writes prompts, Codex stdout/stderr, final summaries, and per-worker JSON results under `outputs/codex_swarm/runs/`, then promotes only the best challenger whose held-out score beats the configured threshold.
197
+ - Refactored `train/benchmark_eval.py` into a reusable library surface with `benchmark_eval_files(...)` plus a structured `BenchmarkResult`, so the CLI benchmark and swarm coordinator share the same evaluation logic.
198
+ - Smoke-tested the new entrypoints locally with `py_compile`, `uv run python -m train.codex_swarm setup --workers 2`, `uv run python -m train.codex_swarm run --workers 2 --rounds 1 --dry-run --serial`, and `uv run python -m train.codex_swarm status`.
199
+ - Next: run a live local Codex round, inspect the first real challenger diffs and benchmark scores, then decide whether to broaden the editable surface beyond `eval.py`.
200
+
201
+ ## 2026-03-08 23:58 PST
202
+
203
+ - Added `train/benchmark_uci.py`, a separate UCI benchmark entrypoint for anchoring the local eval/search engine against external engines like Stockfish under fixed Chess960 start positions.
204
+ - The new harness loads a local `eval.py`, plays both colors against a UCI engine, and reports wins, draws, losses, score, and an Elo-style delta estimate so the demo can show both relative improvement and an external anchor.
205
+ - Documented the new Stockfish-style benchmark command in `README.md` alongside the existing baseline-vs-candidate benchmark flow.
206
+ - Smoke-tested the new entrypoint locally with `python3 -m py_compile train/benchmark_uci.py` and `uv run python -m train.benchmark_uci --help`.
207
+
208
+ ## 2026-03-09 00:09 PST
209
+
210
+ - Extended `train/benchmark_uci.py` with repeated `--engine-option NAME=VALUE` support so the repo can run calibrated UCI anchors such as `UCI_LimitStrength=true` and `UCI_Elo=1320` instead of only raw depth-based Stockfish tests.
211
+ - Installed `stockfish` locally and ran the first rough external ladder against the best current challenger from the Codex swarm.
212
+ - On a small `4`-position / `8`-game sample, the worker-1 patch scored `4.5/8` against `stockfish` with `UCI_Elo=1320`, then `2.0/8` against both `UCI_Elo=1600` and `UCI_Elo=1800`; this is noisy but enough to bracket the current engine above the weakest anchor and below the stronger ones.
213
+ - Updated the README example to show the calibrated Stockfish option flow rather than only raw `engine-depth`.
214
+
215
+ ## 2026-03-09 00:20 PST
216
+
217
+ - Tightened `train/codex_swarm.py` for faster live rounds instead of just adding more undirected agents: the default worker count is now `4`, each worker gets an explicit heuristic lane, and the coordinator enforces a per-worker timeout.
218
+ - The default lanes now spread the first wave across king safety, loose-piece/tactical pressure, piece activity, and pawn/rook structure so four Codex workers do not all rediscover the same generic positional patch.
219
+ - Updated the worker prompt so each agent is explicitly told to make one bounded patch, run one final benchmark, and stop. This should keep rounds short enough to iterate like a real champion/challenger loop.
220
+ - Updated `README.md` and `docs/codex-swarm-plan.md` to use the 4-worker setup and document the new `--worker-timeout-sec` flow.
221
+ - Smoke-tested the coordinator changes with `python3 -m py_compile train/codex_swarm.py train/benchmark_uci.py train/benchmark_eval.py`.
222
+
223
+ ## 2026-03-09 00:34 PST
224
+
225
+ - Added `train/benchmark_league.py`, a new league-style self-play benchmark that evaluates one candidate against the original baseline plus the accepted swarm champion history instead of only one current baseline.
226
+ - The default league builder pulls from `outputs/codex_swarm/accepted/`, includes the original baseline, skips the candidate itself, and also skips any accepted snapshot whose contents are byte-identical to the candidate so the league does not accidentally include a mirror match.
227
+ - Smoke-tested the new script with `python3 -m py_compile train/benchmark_league.py`, `uv run python -m train.benchmark_league --help`, and a tiny real run at `--positions 4`.
228
+ - That sample run showed the current champion splitting the small league overall: strong against the original baseline, weaker against the older accepted worker-1 snapshot, and neutral overall on the combined pool.
229
+ - Updated the README with the new league benchmark command so the self-play path is visible next to the existing head-to-head and Stockfish anchor commands.
230
+
231
+ ## 2026-03-09 00:45 PST
232
+
233
+ - Added `train/build_dashboard.py`, a static dashboard generator that reads the swarm ledger, current champion, accepted history, league benchmark, and optional Stockfish anchors, then writes a self-contained `outputs/dashboard/index.html` plus `outputs/dashboard/dashboard_data.json`.
234
+ - The generated page visualizes accepted-champion progression, internal Elo deltas, recent swarm results, league self-play rows, and Stockfish anchor bars without needing a frontend framework or a running web server.
235
+ - Fixed the default dashboard pool so it skips accepted snapshots that are byte-identical to the current champion instead of showing a misleading mirror match.
236
+ - Smoke-tested the generator with `python3 -m py_compile train/build_dashboard.py`, `uv run python -m train.build_dashboard`, and a full `uv run python -m train.build_dashboard --include-stockfish`.
237
+ - Updated the README with the new dashboard command and output paths so the visualization can be regenerated after each swarm round.
238
+
239
+ ## 2026-03-09 00:09 PST
240
+
241
+ - Tightened `train/codex_swarm.py` so worker prompts now explicitly require surgical edits via `apply_patch` and call out a hard diff-size budget instead of loosely asking for “small” changes.
242
+ - Added `--max-diff-lines` to the swarm CLI, defaulting to `80`, and recorded added/deleted diff counts in each worker result so whole-file rewrites are visible in the ledger.
243
+ - Promotion/acceptance now requires both `benchmark.score > min_score` and `diff_lines_added + diff_lines_deleted <= max_diff_lines`, which stops noisy 250-line rewrites from winning by a tiny margin.
244
+ - Updated `README.md` and `docs/codex-swarm-plan.md` to use the new `--max-diff-lines 80` flag in the standard swarm commands.
245
+ - Smoke-tested the coordinator change with `python3 -m py_compile train/codex_swarm.py` and `uv run python -m train.codex_swarm run --workers 2 --rounds 1 --dry-run --serial --max-diff-lines 40`.
246
+
247
+ ## 2026-03-09 00:09 PST
248
+
249
+ - Checked the official Codex docs and aligned `train/codex_swarm.py` to the best practices that are actually supported by the installed CLI on this box.
250
+ - Added a per-worker root `AGENTS.override.md` so the “surgical patch only / no whole-file rewrite / one probe max / one final benchmark” constraints live in Codex’s native instruction channel instead of only in the prompt body.
251
+ - Kept workers on sandboxed automatic execution, but disabled web search with `-c 'web_search="disabled"'` so the swarm stays local and reproducible.
252
+ - Switched worker prompts from giant argv strings to stdin (`codex exec ... -`), which keeps process listings readable and avoids shoving long prompts into the command line.
253
+ - Enabled `--ephemeral` and `--json` for worker execs so automation runs stay stateless and stdout captures machine-readable Codex events for debugging.
254
+ - Verified that `npm install -g @openai/codex@latest` still resolves to `@openai/codex@0.111.0`; this box is already on the newest npm-published CLI, and that version does not support the newer `--ask-for-approval` flag from the docs.
255
+ - Smoke-tested the updated coordinator with `python3 -m py_compile train/codex_swarm.py` and `uv run python -m train.codex_swarm run --workers 1 --rounds 1 --dry-run --serial --max-diff-lines 40`.
256
+
257
+ ## 2026-03-09 00:22 PST
258
+
259
+ - Tightened the swarm loop again after seeing five workers converge on near-identical 250-300 line rewrites without finishing promptly.
260
+ - Changed `train/codex_swarm.py` so Codex workers no longer run `train.benchmark_eval` themselves. They now research, patch `eval.py`, optionally do one tiny local sanity check, and stop.
261
+ - Moved the expensive held-out benchmark fully into the coordinator path and made the coordinator skip benchmarking entirely when a worker exceeds the `--max-diff-lines` budget.
262
+ - Updated `README.md` and `docs/codex-swarm-plan.md` to reflect the new control flow: Codex proposes, coordinator benchmarks, promotion stays centralized.
263
+ - Smoke-tested the refactor with `python3 -m py_compile train/codex_swarm.py` and `uv run python -m train.codex_swarm run --workers 1 --rounds 1 --dry-run --serial --max-diff-lines 40`.
264
+
265
+ ## 2026-03-09 00:31 PST
266
+
267
+ - Switched the swarm default model from `gpt-5.4` to `gpt-5.3-codex` after inspecting live worker diffs and seeing GPT-5.4 repeatedly collapse into near-identical 300-line eval rewrites.
268
+ - Updated the standard swarm commands in `README.md` and `docs/codex-swarm-plan.md` to use `gpt-5.3-codex` as the preferred local worker model.
269
+ - Next: run a single bounded `gpt-5.3-codex` wave, inspect the raw diffs directly, and only restore continuous mode if the patches become smaller and more diverse.
270
+
271
+ ## 2026-03-09 00:42 PST
272
+
273
+ - Refactored `outputs/codex_swarm/champion_eval.py` into explicit swarm hook lanes: `_structure_hook`, `_tactical_hook`, `_activity_hook`, `_pawn_endgame_hook`, and `_initiative_hook`.
274
+ - Preserved the prior champion behavior by wrapping the existing extra heuristics inside the new hook functions instead of changing the score formula itself.
275
+ - Updated `train/codex_swarm.py` so each worker role is now bound to one named hook rather than a vague lane description. Prompts and `AGENTS.override.md` now tell workers to edit only their assigned hook body.
276
+ - Verified the refactor with `python3 -m py_compile train/codex_swarm.py outputs/codex_swarm/champion_eval.py`, a dry-run coordinator pass, and a quick old-vs-new champion benchmark: `score=0.500` over `8` games.
277
+
278
+ ## 2026-03-09 01:06 PST
279
+
280
+ - Found a bug in the swarm diff gate: it was measuring candidate changes against repo `HEAD` instead of the frozen champion snapshot copied into each worker sandbox, which falsely made every worker look like a 300-line rewrite.
281
+ - Fixed `train/codex_swarm.py` to compute diff counts against the pre-run snapshot, not the git checkout below it.
282
+ - Fixed the worker-timeout path so `subprocess.TimeoutExpired` stdout/stderr bytes are decoded cleanly and timed-out workers still return structured results instead of crashing the coordinator.
283
+ - Ran a short hook-targeted `gpt-5.3-codex` probe and confirmed the new structure works: the worker produced a localized patch only inside `_structure_hook` rather than rewriting the evaluator.
284
+ - The first localized patch added king-shield, pawn-storm, and Chess960 castled-structure terms inside `_structure_hook`; benchmark measurement is slower than the interactive loop, but the swarm behavior is finally aligned with the intended patch surface.
285
+
286
+ ## 2026-03-09 01:15 PST
287
+
288
+ - Added a staged benchmark funnel to `train/codex_swarm.py` so workers no longer all pay for the full held-out benchmark.
289
+ - New flow: every eligible patch gets a cheap screen benchmark (`--screen-positions`, default `8`), then only the best screen winner gets the heavier final benchmark (`--positions`, now the final-stage sample count).
290
+ - Added `--screen-positions` and `--screen-min-score` CLI flags; the default fast path is now `8` positions for screening and `16` for the final promotion check.
291
+ - Reduced the recommended worker timeout in the docs from `600s` to `180s` because workers now only patch and return, not benchmark locally.
292
+ - Smoke-tested the updated coordinator with `python3 -m py_compile train/codex_swarm.py`, `uv run python -m train.codex_swarm run --workers 1 --rounds 1 --dry-run --serial --screen-positions 4 --positions 8 --max-diff-lines 40`, and `uv run python -m train.codex_swarm run --help`.
293
+
294
+ ## 2026-03-09 01:28 PST
295
+
296
+ - The first hook-targeted screened round produced a real promotion: `worker-2` patched `_tactical_hook`, screened at `0.656` over `16` games, and held `0.578` over `32` games for an estimated `+54.7 Elo` versus the previous champion.
297
+ - Promoted that tactical hook patch into `outputs/codex_swarm/champion_eval.py` and saved the accepted snapshot as `outputs/codex_swarm/accepted/20260308T092035Z_worker-2_eval.py`.
298
+ - Tightened the round scheduler so it now reads the current champion and prioritizes underdeveloped hook lanes automatically: empty hooks (`return 0`) first, then simple passthrough hooks (`return _base_*`), then already-customized hooks last.
299
+ - Reordered the default worker specializations so the fast three-worker wave now naturally targets `structure`, `pawn_endgame`, and `initiative` once the tactical hook is already carrying custom logic.
300
+ - Smoke-tested the prioritizer with `python3 -m py_compile train/codex_swarm.py`, `uv run python -m train.codex_swarm run --workers 3 --rounds 1 --dry-run --serial --screen-positions 4 --positions 8 --worker-timeout-sec 60`, and a direct hook-state probe under `uv run python -`.
301
+
302
+ ## 2026-03-09 01:54 PST
303
+
304
+ - Multiple follow-up eval-only hook waves regressed on held-out screens: recent `_structure_hook`, `_pawn_endgame_hook`, and `_initiative_hook` candidates all scored below the current tactical-hook champion on fresh `train.benchmark_eval` probes.
305
+ - Concluded that the fastest path to a larger jump was no longer eval stacking but classical search quality, since `src/zero960/engine/search.py` was still a bare fixed-depth negamax with no quiescence or transposition memory.
306
+ - Upgraded `src/zero960/engine/search.py` with:
307
+ - quiescence search at depth-0 leaves (captures and promotions only),
308
+ - transposition-table probe/store using `board._transposition_key()`,
309
+ - killer-move and history-heuristic move ordering on quiet moves.
310
+ - Snapshotted the pre-change search/eval pair into `/tmp/0x960-search-baseline/` and benchmarked the new search against it with `train.benchmark_engine`.
311
+ - Internal engine-vs-engine results were dramatic:
312
+ - `positions=2`: `4.0/4`, `score=1.000`
313
+ - `positions=4`: `8.0/8`, `score=1.000`
314
+ - `positions=8`: `15.5/16`, `score=0.969`, estimated `+596.5 Elo`
315
+ - External anchor also improved sharply under the upgraded search:
316
+ - `uv run python -m train.benchmark_uci --candidate-file outputs/codex_swarm/champion_eval.py --engine-command stockfish --engine-option UCI_LimitStrength=true --engine-option UCI_Elo=1320 --positions 8 --candidate-depth 2 --engine-depth 1 --max-plies 120 --seed 42`
317
+ - result: `12.5/16`, `score=0.781`, estimated `+221.1 Elo` versus the `1320` anchor in this local setup.
318
+
319
+ ## 2026-03-09 03:08 PST
320
+
321
+ - Extended `train/codex_swarm.py` with a second swarm surface: `--surface search`.
322
+ - Search-mode workers now edit only `src/zero960/engine/search.py`, targeting one named search function per worker:
323
+ - `_move_order_score`
324
+ - `_quiescence`
325
+ - `negamax`
326
+ - `select_move`
327
+ - `_tactical_moves`
328
+ - Added `outputs/codex_swarm/champion_search.py` as the frozen swarm search baseline, parallel to `champion_eval.py`.
329
+ - The coordinator now snapshots a per-round baseline engine root and uses `train.benchmark_engine` for search-surface promotion, so each side gets its own eval plus its own searcher during held-out matches.
330
+ - Added benchmark timeout support to `train/codex_swarm.py` via `--benchmark-timeout-sec` so pathological search patches can be rejected instead of stalling the whole swarm.
331
+ - Updated `train/build_dashboard.py` to support `--include-engine-progress`, which benchmarks the current champion eval plus current repo search against `/tmp/0x960-search-baseline` and exposes that result in `dashboard_data.json` / `index.html`.
332
+ - Updated `README.md` and `docs/codex-swarm-plan.md` to document:
333
+ - the new search-surface swarm command
334
+ - the engine-progress dashboard command
335
+ - the difference between eval-surface and search-surface promotion
336
+ - Smoke-tested the new coordinator and dashboard code with:
337
+ - `python3 -m py_compile train/codex_swarm.py train/build_dashboard.py`
338
+ - `uv run python -m train.codex_swarm run --workers 2 --rounds 1 --dry-run --serial --surface search --screen-positions 2 --positions 4 --worker-timeout-sec 60 --benchmark-timeout-sec 30`
339
+
340
+ ## 2026-03-09 03:23 PST
341
+
342
+ - The first real search-surface Codex round produced clean small patches in `_move_order_score` and `_quiescence`, but both candidates timed out in the original search screen benchmark configuration.
343
+ - Tightened the search-surface coordinator path so search screening is now intentionally cheaper than eval screening:
344
+ - added `--search-screen-positions`
345
+ - added `--search-screen-depth`
346
+ - added `--search-screen-max-plies`
347
+ - added a separate `--final-benchmark-timeout-sec`
348
+ - Current intended search fast path is:
349
+ - cheap screen: `positions=1`, `depth=1`, `max_plies=20`
350
+ - final check: a slightly heavier engine-vs-engine match with its own timeout budget
351
+ - Fixed a worker snapshot refresh race in `_copy_tree()` by switching the pre-copy cleanup to `shutil.rmtree(..., ignore_errors=True)`, which avoids spurious `FileNotFoundError` failures when reusing local worker sandboxes under `/tmp/0x960-codex-swarm/`.
352
+ - Smoke-tested the cheaper search-screen path with:
353
+ - `python3 -m py_compile train/codex_swarm.py`
354
+ - `uv run python -m train.codex_swarm run --workers 2 --rounds 1 --dry-run --serial --surface search --search-screen-positions 1 --search-screen-depth 1 --search-screen-max-plies 20 --positions 4 --depth 2 --max-plies 120 --worker-timeout-sec 60 --benchmark-timeout-sec 20 --final-benchmark-timeout-sec 60`
355
+ - Direct fast-screen probes against the earlier search candidates finally returned promptly at `max_plies=20`:
356
+ - move-ordering patch: `score=0.500` over `2` games
357
+ - quiescence patch: `score=0.500` over `2` games
358
+ - That is not enough to claim improvement, but it proves the search-surface screen is now operational instead of timing out by default.
359
+ - The current engine also held up better than the earlier rough anchor read against a bigger `Stockfish UCI_Elo=1600` sample:
360
+ - `uv run python -m train.benchmark_uci --candidate-file outputs/codex_swarm/champion_eval.py --engine-command stockfish --engine-option UCI_LimitStrength=true --engine-option UCI_Elo=1600 --positions 4 --candidate-depth 2 --engine-depth 1 --max-plies 120 --seed 42`
361
+ - result: `4.5/8`, `score=0.5625`, estimated `+43.7 Elo` versus that local `1600` anchor setting
362
+ - Loosened the search-surface screen gate in `train/codex_swarm.py` so neutral search screens (`score == threshold`) can still advance to one heavier final benchmark. The ultra-fast `2`-game search screen is too coarse to treat `0.500` as automatic rejection.
363
+
364
+ ## 2026-03-09 04:14 PST
365
+
366
+ - Stopped waiting on slow full-match probes and moved to faster direct checks on the already-strong search baseline.
367
+ - Added selective root deepening to `src/zero960/engine/search.py` and synced the same change into `outputs/codex_swarm/champion_search.py`:
368
+ - when the root is in check,
369
+ - or the root move count is small (`<= 12`),
370
+ - or the game is in a low-material endgame with moderate branching,
371
+ - `select_move(..., depth=2)` now searches one extra ply at the root instead of paying for full-time `depth=3`.
372
+ - Timing sanity checks on the current champion eval with the new searcher:
373
+ - opening roots at nominal `depth=2` stayed fast (`~0.07s` to `0.11s` on three sampled Chess960 starts),
374
+ - a short 10-ply sample game mostly stayed under `~1.0s` per move, with a few heavier later plies around `1.0s`,
375
+ - full `depth=3` remained much slower (`~1.3s` to `1.5s` opening roots, growing to multi-second later plies), so selective root deepening is the better trade for now.
376
+ - Quick engine checks on the selective-depth searcher:
377
+ - internal engine-vs-engine smoke test against `/tmp/0x960-search-baseline` with `positions=1`, `depth=2`, `max_plies=80` still swept `2/2` games.
378
+ - local anchor smoke test against `Stockfish UCI_Elo=1600` with `positions=1`, `candidate_depth=2`, `engine_depth=1`, `max_plies=80` also scored `2/2` games.
379
+ - Synced the measured best eval surface into `src/zero960/workspace_template/eval.py` so the actual environment workspace now matches `outputs/codex_swarm/champion_eval.py` instead of lagging behind the swarm champion.
380
+
381
+ ## 2026-03-09 04:28 PST
382
+
383
+ - Fixed a real search-quality bug in both `src/zero960/engine/search.py` and `outputs/codex_swarm/champion_search.py`: quiescence no longer uses stand-pat when the side to move is in check, and it now searches all legal evasions in that case instead of only tactical captures.
384
+ - Timing sanity after the in-check quiescence fix stayed healthy on sampled Chess960 openings at nominal `depth=2`:
385
+ - `0.099s`, `0.058s`, and `0.069s` on three sampled starts.
386
+ - Filled the previously empty `_structure_hook` in both `src/zero960/workspace_template/eval.py` and `outputs/codex_swarm/champion_eval.py` with conservative pawn-coordination terms:
387
+ - connected pawns,
388
+ - pawn chains,
389
+ - central pawn duos,
390
+ - modest bonuses for advanced central pawns,
391
+ - all phase-weighted so they matter in the middlegame without distorting late endgames.
392
+ - Avoided further king-safety duplication in that hook; the new structure terms are intended to complement the existing tactical/activity hooks rather than re-score the same shelter signals.
393
+
394
+ ## 2026-03-09 04:36 PST
395
+
396
+ - Added a persistent module-level transposition table in both `src/zero960/engine/search.py` and `outputs/codex_swarm/champion_search.py` instead of rebuilding the TT from scratch on every `select_move` call.
397
+ - Also started using the stored TT best move at the root for move ordering.
398
+ - This is a classical engine improvement rather than a prompt/surface change: later moves in the same game can now reuse earlier search work.
399
+ - Short same-game timing probe on Chess960 start `123` at nominal `depth=2` improved substantially versus the earlier selective-depth-only version:
400
+ - early plies dropped to roughly `0.05s` to `0.10s`,
401
+ - later mid-opening plies stayed around `0.32s` to `0.63s`,
402
+ - compared to the prior selective-depth run where similar later plies were around `0.62s` to `1.03s`.
403
+ - Kept the selective root deepening path in place, so the current searcher now combines:
404
+ - quiescence,
405
+ - TT probe/store,
406
+ - persistent TT reuse across moves,
407
+ - killer/history ordering,
408
+ - selective one-ply root extensions in tactical / low-branching roots.
409
+
410
+ ## 2026-03-09 04:44 PST
411
+
412
+ - Added principal variation search (PVS) to both `src/zero960/engine/search.py` and `outputs/codex_swarm/champion_search.py`:
413
+ - first move at a node is searched on the full window,
414
+ - later moves use a zero-window search first,
415
+ - only fail-high candidates get the full re-search.
416
+ - This is another classical-engine speed optimization on top of the earlier alpha-beta + TT stack.
417
+ - Same 10-ply timing probe on Chess960 start `123` at nominal `depth=2` improved again versus the TT-persistent version:
418
+ - later plies that had been around `0.32s` to `0.63s` came down to roughly `0.25s` to `0.46s`,
419
+ - opening plies stayed in the same healthy range (`~0.05s` to `0.11s`).
420
+ - Current search stack is now:
421
+ - alpha-beta negamax,
422
+ - quiescence with in-check evasions,
423
+ - TT probe/store,
424
+ - persistent TT reuse across moves,
425
+ - TT root move ordering,
426
+ - killer/history ordering,
427
+ - PVS,
428
+ - selective one-ply root extensions.
429
+
430
+ ## 2026-03-09 04:51 PST
431
+
432
+ - Spent part of the newly-won search speed on opening strength by widening selective root deepening for the very early game:
433
+ - new rule in both `src/zero960/engine/search.py` and `outputs/codex_swarm/champion_search.py`:
434
+ - if `fullmove_number <= 2` and the root has `<= 24` legal moves, search one extra ply.
435
+ - Short timing probe on Chess960 start `123` at nominal `depth=2` after this change:
436
+ - first two plies were about `0.72s` to `0.79s`,
437
+ - later plies mostly stayed below `~1.0s`,
438
+ - move choices changed from the previous PVS-only run, which is exactly the intended effect.
439
+ - This is a deliberate trade:
440
+ - use the earlier TT/PVS speed wins to buy more opening search depth,
441
+ - keep the rest of the game closer to the cheaper depth-2 profile.
442
+
443
+ ## 2026-03-09 05:00 PST
444
+
445
+ - Added null-move pruning in both `src/zero960/engine/search.py` and `outputs/codex_swarm/champion_search.py` for non-check, non-endgame nodes at depth `>= 3`.
446
+ - Null-move did not produce a clean universal speed win on the sampled 10-ply probe, but it did alter the searched lines and reduced some later plies while making the earliest opening ply somewhat heavier. Kept it in place as a standard classical pruning rule pending larger-match validation.
447
+ - Added persistent history ordering across moves in both search files so quiet-move ordering can reuse what the engine has already learned earlier in the same game.
448
+ - Timing on the same Chess960 start after the last two changes stayed in the same general operating envelope:
449
+ - opening plies roughly `0.86s` to `1.21s` under the widened early-opening extension,
450
+ - later plies mostly around `0.10s` to `0.72s`,
451
+ - still materially better than the older pre-TT / pre-PVS search stack on later same-game plies.
452
+ - Tiny one-position `Stockfish UCI_Elo=1600` anchor probes are still too slow / flaky to treat as decision-grade, so the most reliable signal from this phase remains the measured same-game search speed improvements plus the earlier larger baseline/anchor results already recorded above.
453
+
454
+ ## 2026-03-09 05:08 PST
455
+
456
+ - Added late move reductions (LMR) in both `src/zero960/engine/search.py` and `outputs/codex_swarm/champion_search.py`:
457
+ - later quiet moves at depth `>= 3` are searched at one reduced ply first,
458
+ - only moves that improve the window get the full re-search.
459
+ - Added aspiration windows at the root, seeded from the persistent TT score with automatic fallback to a full window on fail-low / fail-high.
460
+ - On the standard 10-ply Chess960 timing probe, these two changes kept the engine on a better branch:
461
+ - first ply dropped to about `0.86s`,
462
+ - second ply to about `0.61s`,
463
+ - later plies mostly in the `0.11s` to `0.46s` range.
464
+ - Also tried quiescence delta pruning as another leaf-speed optimization, but reverted it after it made several early plies materially worse on the same probe.
465
+ - Current kept search stack is therefore:
466
+ - alpha-beta negamax
467
+ - quiescence with in-check evasions
468
+ - TT probe/store
469
+ - persistent TT reuse across moves
470
+ - TT root move ordering
471
+ - persistent history ordering
472
+ - killer ordering
473
+ - PVS
474
+ - null-move pruning
475
+ - LMR
476
+ - aspiration windows at the root
477
+ - selective opening / tactical / endgame root extensions
478
+
479
+ ## 2026-03-09 05:16 PST
480
+
481
+ - Tried widening the opening-depth policy from:
482
+ - `fullmove_number <= 2` / `<= 24` legal moves
483
+ to
484
+ - `fullmove_number <= 3` / `<= 22` legal moves.
485
+ - On the standard 10-ply timing probe, that pushed the first opening plies too high (`~1.40s` and `~1.05s`) without enough evidence of compensating benefit, so the change was reverted.
486
+ - Keeping the more conservative opening-depth rule that was already in place before that experiment.
docs/why_chess960.md CHANGED
@@ -1,4 +1,4 @@
1
- # why chess960
2
 
3
  ## short version
4
 
@@ -23,7 +23,7 @@ We should claim something narrower and more defensible:
23
  - strong standard-chess performance does not automatically transfer
24
  - this makes Chess960 a good downstream benchmark for a tool-using self-improvement environment
25
 
26
- ## relation to the project
27
 
28
  0x960 is not a move-prediction benchmark. The model does not play moves directly as its primary task.
29
 
 
1
+ # Why Chess960
2
 
3
  ## short version
4
 
 
23
  - strong standard-chess performance does not automatically transfer
24
  - this makes Chess960 a good downstream benchmark for a tool-using self-improvement environment
25
 
26
+ ## Relation to 0x960
27
 
28
  0x960 is not a move-prediction benchmark. The model does not play moves directly as its primary task.
29
 
media/submission/0x960_score_progression.png ADDED
media/submission/0x960_score_progression.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ Champion score progression (all attempts)
2
+ points=9
3
+ min=0.4305
4
+ max=0.7219
media/submission/0x960_stockfish_anchors.png ADDED
media/submission/0x960_stockfish_anchors.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ Stockfish anchor bars
2
+ anchors=2
3
+ elo range=1320-1600
media/submission/submission_summary.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Accepted samples:
2
+ round_20260308T063558Z_1: 0.6172 (yes)
3
+ round_20260308T070827Z_1: 0.5859 (yes)
4
+ round_20260308T091220Z_1: 0.5781 (yes)
5
+ round_20260308T111412Z_1: 0.6875 (yes)
scripts/generate_submission_media.py ADDED
@@ -0,0 +1,304 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Generate tracked PNG graphs from benchmark artifacts for submission media."""
3
+
4
+ from __future__ import annotations
5
+
6
+ import json
7
+ import math
8
+ import struct
9
+ import zlib
10
+ from dataclasses import dataclass
11
+ from pathlib import Path
12
+
13
+
14
+ @dataclass(slots=True)
15
+ class Color:
16
+ r: int
17
+ g: int
18
+ b: int
19
+
20
+ def to_tuple(self) -> tuple[int, int, int]:
21
+ return (self.r, self.g, self.b)
22
+
23
+
24
+ WHITE = Color(245, 247, 250)
25
+ BG = Color(13, 17, 23)
26
+ AXIS = Color(132, 146, 165)
27
+ GRID = Color(44, 58, 73)
28
+ LINE = Color(88, 166, 255)
29
+ GOOD = Color(63, 185, 80)
30
+ BAD = Color(248, 81, 73)
31
+ MID = Color(210, 153, 34)
32
+ TEXT = Color(230, 237, 243)
33
+
34
+
35
+ class Canvas:
36
+ def __init__(self, width: int, height: int, bg: Color = BG) -> None:
37
+ self.width = width
38
+ self.height = height
39
+ self.pixels = [[bg.to_tuple() for _ in range(width)] for _ in range(height)]
40
+
41
+ def set_pixel(self, x: int, y: int, color: Color) -> None:
42
+ if 0 <= x < self.width and 0 <= y < self.height:
43
+ self.pixels[y][x] = color.to_tuple()
44
+
45
+ def line(self, x0: int, y0: int, x1: int, y1: int, color: Color) -> None:
46
+ dx = abs(x1 - x0)
47
+ dy = -abs(y1 - y0)
48
+ sx = 1 if x0 < x1 else -1
49
+ sy = 1 if y0 < y1 else -1
50
+ err = dx + dy
51
+ while True:
52
+ self.set_pixel(x0, y0, color)
53
+ if x0 == x1 and y0 == y1:
54
+ break
55
+ e2 = 2 * err
56
+ if e2 >= dy:
57
+ err += dy
58
+ x0 += sx
59
+ if e2 <= dx:
60
+ err += dx
61
+ y0 += sy
62
+
63
+ def rect(
64
+ self,
65
+ x0: int,
66
+ y0: int,
67
+ x1: int,
68
+ y1: int,
69
+ color: Color,
70
+ fill: bool = True,
71
+ ) -> None:
72
+ if fill:
73
+ for yy in range(max(0, y0), min(self.height, y1 + 1)):
74
+ for xx in range(max(0, x0), min(self.width, x1 + 1)):
75
+ self.pixels[yy][xx] = color.to_tuple()
76
+ else:
77
+ self.line(x0, y0, x1, y0, color)
78
+ self.line(x0, y1, x1, y1, color)
79
+ self.line(x0, y0, x0, y1, color)
80
+ self.line(x1, y0, x1, y1, color)
81
+
82
+ def circle(self, x: int, y: int, radius: int, color: Color) -> None:
83
+ for dy in range(-radius, radius + 1):
84
+ for dx in range(-radius, radius + 1):
85
+ if dx * dx + dy * dy <= radius * radius:
86
+ self.set_pixel(x + dx, y + dy, color)
87
+
88
+ def write_png(self, path: Path) -> None:
89
+ body = bytearray()
90
+ for row in self.pixels:
91
+ body.append(0)
92
+ row_bytes = bytearray()
93
+ for pixel in row:
94
+ row_bytes.extend(bytearray(pixel))
95
+ body.extend(row_bytes)
96
+ raw = zlib.compress(bytes(body), 9)
97
+
98
+ def chunk(chunk_type: bytes, data: bytes) -> bytes:
99
+ size = len(data)
100
+ head = struct.pack(">I", size) + chunk_type + data
101
+ crc = zlib.crc32(chunk_type + data) & 0xFFFFFFFF
102
+ return struct.pack(">I", size) + chunk_type + data + struct.pack(">I", crc)
103
+
104
+ ihdr = struct.pack(
105
+ ">IIBBBBB",
106
+ self.width,
107
+ self.height,
108
+ 8,
109
+ 2,
110
+ 0,
111
+ 0,
112
+ 0,
113
+ )
114
+ png_data = (
115
+ b"\x89PNG\r\n\x1a\n"
116
+ + chunk(b"IHDR", ihdr)
117
+ + chunk(b"IDAT", raw)
118
+ + chunk(b"IEND", b"")
119
+ )
120
+ path.parent.mkdir(parents=True, exist_ok=True)
121
+ path.write_bytes(png_data)
122
+
123
+
124
+ def _draw_axes(chart: Canvas, left: int, right: int, top: int, bottom: int) -> None:
125
+ chart.line(left, bottom, right, bottom, AXIS)
126
+ chart.line(left, top, left, bottom, AXIS)
127
+ for i in range(5):
128
+ x = left + int((right - left) * (i / 4))
129
+ chart.line(x, top, x, bottom, GRID)
130
+ chart.line(left, top + int((bottom - top) * (i / 4)), right, top + int((bottom - top) * (i / 4)), GRID)
131
+
132
+
133
+ def _norm(value: float, lo: float, hi: float) -> float:
134
+ if hi == lo:
135
+ return 0.0
136
+ return (value - lo) / (hi - lo)
137
+
138
+
139
+ def _plot_line_chart(
140
+ out_path: Path,
141
+ points: list[tuple[str, float, bool]],
142
+ title: str,
143
+ ) -> None:
144
+ if not points:
145
+ return
146
+
147
+ width, height = 1200, 700
148
+ canvas = Canvas(width, height)
149
+ left, right = 100, width - 80
150
+ top, bottom = 120, height - 90
151
+ _draw_axes(canvas, left, right, top, bottom)
152
+
153
+ values = [p[1] for p in points]
154
+ min_v = min(values) * 0.95
155
+ max_v = max(values) * 1.05
156
+ if min_v == max_v:
157
+ min_v -= 0.1
158
+ max_v += 0.1
159
+
160
+ def point_to_xy(index: int, value: float) -> tuple[int, int]:
161
+ x = left + int((right - left) * (index / max(len(points) - 1, 1)))
162
+ y = bottom - int((_norm(value, min_v, max_v)) * (bottom - top))
163
+ return x, y
164
+
165
+ for idx in range(len(points) - 1):
166
+ x0, y0 = point_to_xy(idx, points[idx][1])
167
+ x1, y1 = point_to_xy(idx + 1, points[idx + 1][1])
168
+ color = GOOD if points[idx + 1][2] else MID
169
+ canvas.line(x0, y0, x1, y1, color)
170
+
171
+ for idx, (_, value, accepted) in enumerate(points):
172
+ x, y = point_to_xy(idx, value)
173
+ canvas.circle(x, y, 5, GOOD if accepted else BAD)
174
+
175
+ for x in range(len(points)):
176
+ px, py = point_to_xy(x, points[x][1])
177
+ canvas.line(px, py + 8, px, bottom, AXIS)
178
+ canvas.set_pixel(px, bottom + 2, TEXT)
179
+
180
+ canvas.line(left + 1, top + 20, right - 1, top + 20, GRID)
181
+ canvas.set_pixel(left + 2, top + 5, TEXT)
182
+
183
+ # Simple title marker in shape (no text due no font dependency)
184
+ canvas.rect(left + 4, 22, left + 14, 36, AXIS, fill=False)
185
+ canvas.line(right - 200, 34, right - 80, 34, AXIS)
186
+ canvas.set_pixel(right - 60, 34, TEXT)
187
+
188
+ canvas.write_png(out_path)
189
+ _write_caption(
190
+ out_path.with_suffix(".txt"),
191
+ [
192
+ title,
193
+ f"points={len(points)}",
194
+ f"min={min_v:.4f}",
195
+ f"max={max_v:.4f}",
196
+ ],
197
+ )
198
+
199
+
200
+ def _plot_anchor_bars(out_path: Path, anchors: list[dict[str, object]]) -> None:
201
+ width, height = 1200, 700
202
+ canvas = Canvas(width, height, BG)
203
+ left, right = 120, width - 80
204
+ top, bottom = 140, height - 130
205
+ _draw_axes(canvas, left, right, top, bottom)
206
+
207
+ if not anchors:
208
+ canvas.line(left + 1, bottom - 1, right - 1, top + 1, MID)
209
+ canvas.write_png(out_path)
210
+ return
211
+
212
+ bars = []
213
+ for row in anchors:
214
+ elo = float(row.get("uci_elo", 0))
215
+ score = float(row.get("score", 0.5))
216
+ bars.append((elo, score))
217
+
218
+ bar_space = (right - left) / max(len(bars), 1)
219
+ min_score = min(score for _, score in bars)
220
+ max_score = max(score for _, score in bars)
221
+ if min_score == max_score:
222
+ min_score -= 0.05
223
+ max_score += 0.05
224
+
225
+ for idx, (elo, score) in enumerate(bars):
226
+ x0 = int(left + idx * bar_space + bar_space * 0.2)
227
+ x1 = int(left + (idx + 1) * bar_space - bar_space * 0.2)
228
+ y = bottom - int(_norm(score, min_score, max_score) * (bottom - top))
229
+ canvas.rect(x0, y, x1, bottom, GOOD if score > 0.5 else BAD)
230
+ label = int(elo)
231
+ chart_pos = x0 + 6
232
+ for digit in str(label):
233
+ if chart_pos < width - 20:
234
+ chart_pos += 10
235
+
236
+ canvas.write_png(out_path)
237
+ _write_caption(
238
+ out_path.with_suffix(".txt"),
239
+ [
240
+ "Stockfish anchor bars",
241
+ f"anchors={len(anchors)}",
242
+ f"elo range={int(min(e for e, _ in bars))}-{int(max(e for e, _ in bars))}",
243
+ ],
244
+ )
245
+
246
+
247
+ def _write_caption(path: Path, lines: list[str]) -> None:
248
+ path.write_text("\n".join(lines), encoding="utf-8")
249
+
250
+
251
+ def load_dashboard_data(path: Path) -> dict:
252
+ if not path.exists():
253
+ raise FileNotFoundError(f"missing dashboard data: {path}")
254
+ return json.loads(path.read_text(encoding="utf-8"))
255
+
256
+
257
+ def main() -> None:
258
+ root = Path(__file__).resolve().parents[1]
259
+ data = load_dashboard_data(root / "outputs" / "dashboard" / "dashboard_data.json")
260
+ media_root = root / "media" / "submission"
261
+ media_root.mkdir(parents=True, exist_ok=True)
262
+
263
+ accepted = [
264
+ (row.get("round_name", f"#{idx}"), float(row.get("score", 0.5)), bool(row.get("accepted", False)))
265
+ for idx, row in enumerate(data.get("accepted_results", []))
266
+ ]
267
+ all_results = [
268
+ (row.get("round_name", f"#{idx}"), float(row.get("score", 0.5)), bool(row.get("accepted", False)))
269
+ for idx, row in enumerate(data.get("all_results", []))
270
+ ]
271
+
272
+ if all_results:
273
+ _plot_line_chart(
274
+ media_root / "0x960_score_progression.png",
275
+ all_results,
276
+ "Champion score progression (all attempts)",
277
+ )
278
+ else:
279
+ _plot_line_chart(
280
+ media_root / "0x960_score_progression.png",
281
+ [("n/a", 0.5, True)],
282
+ "Champion score progression (empty)",
283
+ )
284
+
285
+ _plot_anchor_bars(
286
+ media_root / "0x960_stockfish_anchors.png",
287
+ data.get("stockfish_anchors", []),
288
+ )
289
+
290
+ if accepted:
291
+ _write_caption(
292
+ media_root / "submission_summary.txt",
293
+ [
294
+ "Accepted samples:",
295
+ *(
296
+ f"{round_name}: {score:.4f} ({'yes' if accepted else 'no'})"
297
+ for round_name, score, accepted in accepted
298
+ ),
299
+ ],
300
+ )
301
+
302
+
303
+ if __name__ == "__main__":
304
+ main()
src/zero960/engine/default_eval.py CHANGED
@@ -5,33 +5,194 @@ import chess
5
  PIECE_VALUES = {
6
  chess.PAWN: 100,
7
  chess.KNIGHT: 320,
8
- chess.BISHOP: 330,
9
  chess.ROOK: 500,
10
  chess.QUEEN: 900,
11
  chess.KING: 0,
12
  }
13
 
14
- CENTER_SQUARES = {chess.D4, chess.E4, chess.D5, chess.E5}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
 
17
- def evaluate(board: chess.Board) -> int:
18
- """Return a simple white-centric score in centipawns."""
19
- if board.is_checkmate():
20
- return -100_000 if board.turn == chess.WHITE else 100_000
21
- if board.is_stalemate() or board.is_insufficient_material():
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  return 0
23
 
 
 
 
 
 
 
 
 
 
 
24
  score = 0
25
  for piece_type, piece_value in PIECE_VALUES.items():
26
- score += piece_value * len(board.pieces(piece_type, chess.WHITE))
27
- score -= piece_value * len(board.pieces(piece_type, chess.BLACK))
28
 
29
- for square in CENTER_SQUARES:
30
- piece = board.piece_at(square)
31
- if piece is None:
32
- continue
33
- score += 15 if piece.color == chess.WHITE else -15
34
 
35
- score += 2 * board.legal_moves.count() if board.turn == chess.WHITE else -2 * board.legal_moves.count()
 
 
 
 
 
36
  return score
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  PIECE_VALUES = {
6
  chess.PAWN: 100,
7
  chess.KNIGHT: 320,
8
+ chess.BISHOP: 335,
9
  chess.ROOK: 500,
10
  chess.QUEEN: 900,
11
  chess.KING: 0,
12
  }
13
 
14
+ CENTER_SQUARES = (chess.D4, chess.E4, chess.D5, chess.E5)
15
+ EXTENDED_CENTER = (
16
+ chess.C3, chess.D3, chess.E3, chess.F3,
17
+ chess.C4, chess.D4, chess.E4, chess.F4,
18
+ chess.C5, chess.D5, chess.E5, chess.F5,
19
+ chess.C6, chess.D6, chess.E6, chess.F6,
20
+ )
21
+ PIECE_MOBILITY_WEIGHTS = {
22
+ chess.KNIGHT: 4,
23
+ chess.BISHOP: 5,
24
+ chess.ROOK: 3,
25
+ chess.QUEEN: 2,
26
+ }
27
+ BISHOP_PAIR_BONUS = 35
28
+ ROOK_OPEN_FILE_BONUS = 20
29
+ ROOK_SEMIOPEN_FILE_BONUS = 10
30
+ DOUBLED_PAWN_PENALTY = 18
31
+ ISOLATED_PAWN_PENALTY = 14
32
+ BACK_RANK_MINOR_PENALTY = 10
33
+ CENTER_OCCUPANCY_BONUS = 14
34
+ CENTER_ATTACK_BONUS = 3
35
+ CASTLING_RIGHTS_BONUS = 12
36
+ TEMPO_BONUS = 8
37
+ PASSED_PAWN_BONUS_BY_RANK = [0, 5, 10, 18, 28, 42, 60, 0]
38
 
39
 
40
+ def _phase(board: chess.Board) -> int:
41
+ phase = 0
42
+ phase += 4 * (len(board.pieces(chess.QUEEN, chess.WHITE)) + len(board.pieces(chess.QUEEN, chess.BLACK)))
43
+ phase += 2 * (len(board.pieces(chess.ROOK, chess.WHITE)) + len(board.pieces(chess.ROOK, chess.BLACK)))
44
+ phase += len(board.pieces(chess.BISHOP, chess.WHITE)) + len(board.pieces(chess.BISHOP, chess.BLACK))
45
+ phase += len(board.pieces(chess.KNIGHT, chess.WHITE)) + len(board.pieces(chess.KNIGHT, chess.BLACK))
46
+ return min(phase, 24)
47
+
48
+
49
+ def _friendly(square: int, color: chess.Color, board: chess.Board) -> bool:
50
+ return board.color_at(square) == color
51
+
52
+
53
+ def _file_pawn_counts(board: chess.Board, color: chess.Color) -> list[int]:
54
+ counts = [0] * 8
55
+ for square in board.pieces(chess.PAWN, color):
56
+ counts[chess.square_file(square)] += 1
57
+ return counts
58
+
59
+
60
+ def _pawn_structure_score(board: chess.Board, color: chess.Color) -> int:
61
+ score = 0
62
+ pawns = sorted(board.pieces(chess.PAWN, color))
63
+ enemy_pawns = list(board.pieces(chess.PAWN, not color))
64
+ file_counts = _file_pawn_counts(board, color)
65
+
66
+ for count in file_counts:
67
+ if count > 1:
68
+ score -= DOUBLED_PAWN_PENALTY * (count - 1)
69
+
70
+ for square in pawns:
71
+ file_index = chess.square_file(square)
72
+ left_count = file_counts[file_index - 1] if file_index > 0 else 0
73
+ right_count = file_counts[file_index + 1] if file_index < 7 else 0
74
+ if left_count == 0 and right_count == 0:
75
+ score -= ISOLATED_PAWN_PENALTY
76
+
77
+ rank_index = chess.square_rank(square)
78
+ blocked = False
79
+ for enemy_square in enemy_pawns:
80
+ enemy_file = chess.square_file(enemy_square)
81
+ if abs(enemy_file - file_index) > 1:
82
+ continue
83
+ enemy_rank = chess.square_rank(enemy_square)
84
+ if color == chess.WHITE and enemy_rank > rank_index:
85
+ blocked = True
86
+ break
87
+ if color == chess.BLACK and enemy_rank < rank_index:
88
+ blocked = True
89
+ break
90
+ if not blocked:
91
+ advance = rank_index if color == chess.WHITE else 7 - rank_index
92
+ score += PASSED_PAWN_BONUS_BY_RANK[advance]
93
+
94
+ return score
95
+
96
+
97
+ def _mobility_score(board: chess.Board, color: chess.Color) -> int:
98
+ score = 0
99
+ friendly_mask = board.occupied_co[color]
100
+ for piece_type, weight in PIECE_MOBILITY_WEIGHTS.items():
101
+ for square in board.pieces(piece_type, color):
102
+ attacks = board.attacks_mask(square) & ~friendly_mask
103
+ score += weight * chess.popcount(attacks)
104
+ return score
105
+
106
+
107
+ def _center_score(board: chess.Board, color: chess.Color) -> int:
108
+ score = 0
109
+ for square in CENTER_SQUARES:
110
+ if _friendly(square, color, board):
111
+ score += CENTER_OCCUPANCY_BONUS
112
+
113
+ for square in EXTENDED_CENTER:
114
+ score += CENTER_ATTACK_BONUS * chess.popcount(board.attackers_mask(color, square))
115
+ return score
116
+
117
+
118
+ def _rook_file_score(board: chess.Board, color: chess.Color) -> int:
119
+ score = 0
120
+ friendly_pawns = board.pieces(chess.PAWN, color)
121
+ enemy_pawns = board.pieces(chess.PAWN, not color)
122
+ for square in board.pieces(chess.ROOK, color):
123
+ file_index = chess.square_file(square)
124
+ friendly_on_file = any(chess.square_file(pawn_square) == file_index for pawn_square in friendly_pawns)
125
+ enemy_on_file = any(chess.square_file(pawn_square) == file_index for pawn_square in enemy_pawns)
126
+ if not friendly_on_file:
127
+ score += ROOK_SEMIOPEN_FILE_BONUS
128
+ if not enemy_on_file:
129
+ score += ROOK_OPEN_FILE_BONUS
130
+ return score
131
+
132
+
133
+ def _king_safety_score(board: chess.Board, color: chess.Color, phase: int) -> int:
134
+ king_square = board.king(color)
135
+ if king_square is None:
136
+ return 0
137
+
138
+ score = 0
139
+ king_file = chess.square_file(king_square)
140
+ king_rank = chess.square_rank(king_square)
141
+
142
+ for file_index in range(max(0, king_file - 1), min(7, king_file + 1) + 1):
143
+ shelter_ranks = [king_rank + 1, king_rank + 2] if color == chess.WHITE else [king_rank - 1, king_rank - 2]
144
+ for rank_index in shelter_ranks:
145
+ if 0 <= rank_index < 8 and _friendly(chess.square(file_index, rank_index), color, board):
146
+ score += 4
147
+
148
+ enemy_pressure = 0
149
+ for square in chess.SquareSet(chess.BB_KING_ATTACKS[king_square]):
150
+ enemy_pressure += chess.popcount(board.attackers_mask(not color, square))
151
+ score -= enemy_pressure * (2 + phase // 8)
152
+
153
+ if board.has_castling_rights(color):
154
+ score += CASTLING_RIGHTS_BONUS * phase // 24
155
+ return score
156
+
157
+
158
+ def _development_score(board: chess.Board, color: chess.Color, phase: int) -> int:
159
+ if phase <= 8:
160
  return 0
161
 
162
+ home_rank = 0 if color == chess.WHITE else 7
163
+ penalty = 0
164
+ for piece_type in (chess.KNIGHT, chess.BISHOP):
165
+ for square in board.pieces(piece_type, color):
166
+ if chess.square_rank(square) == home_rank:
167
+ penalty += BACK_RANK_MINOR_PENALTY
168
+ return -penalty
169
+
170
+
171
+ def _side_score(board: chess.Board, color: chess.Color, phase: int) -> int:
172
  score = 0
173
  for piece_type, piece_value in PIECE_VALUES.items():
174
+ score += piece_value * len(board.pieces(piece_type, color))
 
175
 
176
+ if len(board.pieces(chess.BISHOP, color)) >= 2:
177
+ score += BISHOP_PAIR_BONUS
 
 
 
178
 
179
+ score += _pawn_structure_score(board, color)
180
+ score += _mobility_score(board, color)
181
+ score += _center_score(board, color)
182
+ score += _rook_file_score(board, color)
183
+ score += _king_safety_score(board, color, phase)
184
+ score += _development_score(board, color, phase)
185
  return score
186
 
187
+
188
+ def evaluate(board: chess.Board) -> int:
189
+ """Return a Chess960-safe white-centric score in centipawns."""
190
+ if board.is_checkmate():
191
+ return -100_000 if board.turn == chess.WHITE else 100_000
192
+ if board.is_stalemate() or board.is_insufficient_material():
193
+ return 0
194
+
195
+ phase = _phase(board)
196
+ score = _side_score(board, chess.WHITE, phase) - _side_score(board, chess.BLACK, phase)
197
+ score += TEMPO_BONUS if board.turn == chess.WHITE else -TEMPO_BONUS
198
+ return score
src/zero960/engine/search.py CHANGED
@@ -1,11 +1,45 @@
1
  from __future__ import annotations
2
 
3
  from collections.abc import Callable
 
4
 
5
  import chess
6
 
7
  EvalFn = Callable[[chess.Board], int]
8
  MATE_SCORE = 100_000
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
 
11
  def _terminal_score(board: chess.Board) -> int:
@@ -14,30 +48,232 @@ def _terminal_score(board: chess.Board) -> int:
14
  return 0
15
 
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  def _score_for_turn(board: chess.Board, eval_fn: EvalFn) -> int:
18
  score = eval_fn(board)
19
  return score if board.turn == chess.WHITE else -score
20
 
21
 
22
- def negamax(board: chess.Board, depth: int, alpha: int, beta: int, eval_fn: EvalFn) -> int:
23
- if depth == 0 or board.is_game_over(claim_draw=True):
24
- if board.is_game_over(claim_draw=True):
25
- return _terminal_score(board)
26
- return _score_for_turn(board, eval_fn)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
  best_score = -MATE_SCORE
29
- for move in board.legal_moves:
 
 
30
  board.push(move)
31
- score = -negamax(board, depth - 1, -beta, -alpha, eval_fn)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  board.pop()
33
 
34
  if score > best_score:
35
  best_score = score
 
36
  if best_score > alpha:
37
  alpha = best_score
38
  if alpha >= beta:
 
 
 
39
  break
40
 
 
 
 
 
 
 
 
 
41
  return best_score
42
 
43
 
@@ -46,10 +282,61 @@ def select_move(board: chess.Board, depth: int, eval_fn: EvalFn) -> chess.Move:
46
  best_score = -MATE_SCORE
47
  alpha = -MATE_SCORE
48
  beta = MATE_SCORE
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
- for move in board.legal_moves:
51
  board.push(move)
52
- score = -negamax(board, depth - 1, -beta, -alpha, eval_fn)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  board.pop()
54
 
55
  if best_move is None or score > best_score:
@@ -61,4 +348,3 @@ def select_move(board: chess.Board, depth: int, eval_fn: EvalFn) -> chess.Move:
61
  if best_move is None:
62
  raise RuntimeError("no legal move available")
63
  return best_move
64
-
 
1
  from __future__ import annotations
2
 
3
  from collections.abc import Callable
4
+ from typing import NamedTuple
5
 
6
  import chess
7
 
8
  EvalFn = Callable[[chess.Board], int]
9
  MATE_SCORE = 100_000
10
+ TT_EXACT = "exact"
11
+ TT_LOWER = "lower"
12
+ TT_UPPER = "upper"
13
+ MAX_TT_ENTRIES = 50_000
14
+ CAPTURE_ORDER = {
15
+ chess.PAWN: 1,
16
+ chess.KNIGHT: 3,
17
+ chess.BISHOP: 3,
18
+ chess.ROOK: 5,
19
+ chess.QUEEN: 9,
20
+ chess.KING: 0,
21
+ }
22
+ ENDGAME_PHASE_THRESHOLD = 6
23
+ LOW_BRANCHING_THRESHOLD = 12
24
+ ENDGAME_BRANCHING_THRESHOLD = 18
25
+ OPENING_FULLMOVE_LIMIT = 2
26
+ OPENING_BRANCHING_THRESHOLD = 24
27
+ NULL_MOVE_DEPTH_REDUCTION = 2
28
+ NULL_MOVE_MIN_DEPTH = 3
29
+ LMR_MIN_DEPTH = 3
30
+ LMR_MIN_MOVE_INDEX = 3
31
+ ASPIRATION_WINDOW = 60
32
+
33
+
34
+ class TTEntry(NamedTuple):
35
+ depth: int
36
+ score: int
37
+ bound: str
38
+ best_move: chess.Move | None
39
+
40
+
41
+ _GLOBAL_TT: dict[tuple[object, ...], TTEntry] = {}
42
+ _GLOBAL_HISTORY: dict[tuple[int, int], int] = {}
43
 
44
 
45
  def _terminal_score(board: chess.Board) -> int:
 
48
  return 0
49
 
50
 
51
+ def _phase(board: chess.Board) -> int:
52
+ phase = 0
53
+ phase += 4 * (len(board.pieces(chess.QUEEN, chess.WHITE)) + len(board.pieces(chess.QUEEN, chess.BLACK)))
54
+ phase += 2 * (len(board.pieces(chess.ROOK, chess.WHITE)) + len(board.pieces(chess.ROOK, chess.BLACK)))
55
+ phase += len(board.pieces(chess.BISHOP, chess.WHITE)) + len(board.pieces(chess.BISHOP, chess.BLACK))
56
+ phase += len(board.pieces(chess.KNIGHT, chess.WHITE)) + len(board.pieces(chess.KNIGHT, chess.BLACK))
57
+ return min(phase, 24)
58
+
59
+
60
+ def _selective_root_depth(board: chess.Board, depth: int, move_count: int) -> int:
61
+ if depth < 2:
62
+ return depth
63
+ if board.fullmove_number <= OPENING_FULLMOVE_LIMIT and move_count <= OPENING_BRANCHING_THRESHOLD:
64
+ return depth + 1
65
+ if board.is_check() or move_count <= LOW_BRANCHING_THRESHOLD:
66
+ return depth + 1
67
+ if _phase(board) <= ENDGAME_PHASE_THRESHOLD and move_count <= ENDGAME_BRANCHING_THRESHOLD:
68
+ return depth + 1
69
+ return depth
70
+
71
+
72
  def _score_for_turn(board: chess.Board, eval_fn: EvalFn) -> int:
73
  score = eval_fn(board)
74
  return score if board.turn == chess.WHITE else -score
75
 
76
 
77
+ def _move_order_score(
78
+ board: chess.Board,
79
+ move: chess.Move,
80
+ *,
81
+ tt_move: chess.Move | None = None,
82
+ killer_moves: tuple[chess.Move, ...] = (),
83
+ history: dict[tuple[int, int], int] | None = None,
84
+ ) -> int:
85
+ if tt_move is not None and move == tt_move:
86
+ return 1_000_000
87
+
88
+ score = 0
89
+ if board.is_capture(move):
90
+ victim = board.piece_at(move.to_square)
91
+ attacker = board.piece_at(move.from_square)
92
+ if victim is not None:
93
+ score += 100 * CAPTURE_ORDER[victim.piece_type]
94
+ if attacker is not None:
95
+ score -= 10 * CAPTURE_ORDER[attacker.piece_type]
96
+ if move.promotion is not None:
97
+ score += 800 + CAPTURE_ORDER.get(move.promotion, 0)
98
+ if board.gives_check(move):
99
+ score += 50
100
+ if board.is_castling(move):
101
+ score += 25
102
+ if not board.is_capture(move) and move.promotion is None:
103
+ for index, killer in enumerate(killer_moves):
104
+ if move == killer:
105
+ score += 90_000 - index * 10_000
106
+ break
107
+ if history is not None:
108
+ piece_type = board.piece_type_at(move.from_square)
109
+ if piece_type is not None:
110
+ score += history.get((piece_type, move.to_square), 0)
111
+ return score
112
+
113
+
114
+ def _ordered_moves(
115
+ board: chess.Board,
116
+ *,
117
+ tt_move: chess.Move | None = None,
118
+ killer_moves: tuple[chess.Move, ...] = (),
119
+ history: dict[tuple[int, int], int] | None = None,
120
+ ) -> list[chess.Move]:
121
+ return sorted(
122
+ board.legal_moves,
123
+ key=lambda move: _move_order_score(
124
+ board,
125
+ move,
126
+ tt_move=tt_move,
127
+ killer_moves=killer_moves,
128
+ history=history,
129
+ ),
130
+ reverse=True,
131
+ )
132
+
133
+
134
+ def _tactical_moves(board: chess.Board) -> list[chess.Move]:
135
+ return [
136
+ move
137
+ for move in _ordered_moves(board)
138
+ if board.is_capture(move) or move.promotion is not None
139
+ ]
140
+
141
+
142
+ def _record_killer(killers: dict[int, tuple[chess.Move, ...]], ply: int, move: chess.Move) -> None:
143
+ existing = tuple(candidate for candidate in killers.get(ply, ()) if candidate != move)
144
+ killers[ply] = (move, *existing[:1])
145
+
146
+
147
+ def _record_history(
148
+ history: dict[tuple[int, int], int],
149
+ board: chess.Board,
150
+ move: chess.Move,
151
+ depth: int,
152
+ ) -> None:
153
+ piece_type = board.piece_type_at(move.from_square)
154
+ if piece_type is None:
155
+ return
156
+ key = (piece_type, move.to_square)
157
+ history[key] = history.get(key, 0) + depth * depth
158
+
159
+
160
+ def _quiescence(board: chess.Board, alpha: int, beta: int, eval_fn: EvalFn) -> int:
161
+ if board.is_game_over(claim_draw=True):
162
+ return _terminal_score(board)
163
+
164
+ in_check = board.is_check()
165
+ if not in_check:
166
+ stand_pat = _score_for_turn(board, eval_fn)
167
+ if stand_pat >= beta:
168
+ return stand_pat
169
+ if stand_pat > alpha:
170
+ alpha = stand_pat
171
+
172
+ moves = _ordered_moves(board) if in_check else _tactical_moves(board)
173
+ for move in moves:
174
+ board.push(move)
175
+ score = -_quiescence(board, -beta, -alpha, eval_fn)
176
+ board.pop()
177
+ if score >= beta:
178
+ return score
179
+ if score > alpha:
180
+ alpha = score
181
+ return alpha
182
+
183
+
184
+ def negamax(
185
+ board: chess.Board,
186
+ depth: int,
187
+ alpha: int,
188
+ beta: int,
189
+ eval_fn: EvalFn,
190
+ tt: dict[tuple[object, ...], TTEntry],
191
+ killers: dict[int, tuple[chess.Move, ...]],
192
+ history: dict[tuple[int, int], int],
193
+ ply: int = 0,
194
+ ) -> int:
195
+ if board.is_game_over(claim_draw=True):
196
+ return _terminal_score(board)
197
+ if depth == 0:
198
+ return _quiescence(board, alpha, beta, eval_fn)
199
+
200
+ alpha_orig = alpha
201
+ key = board._transposition_key()
202
+ entry = tt.get(key)
203
+ tt_move = entry.best_move if entry is not None else None
204
+ if entry is not None and entry.depth >= depth:
205
+ if entry.bound == TT_EXACT:
206
+ return entry.score
207
+ if entry.bound == TT_LOWER:
208
+ alpha = max(alpha, entry.score)
209
+ elif entry.bound == TT_UPPER:
210
+ beta = min(beta, entry.score)
211
+ if alpha >= beta:
212
+ return entry.score
213
+
214
+ if (
215
+ depth >= NULL_MOVE_MIN_DEPTH
216
+ and not board.is_check()
217
+ and _phase(board) > ENDGAME_PHASE_THRESHOLD
218
+ and beta < MATE_SCORE
219
+ ):
220
+ board.push(chess.Move.null())
221
+ null_score = -negamax(
222
+ board,
223
+ depth - 1 - NULL_MOVE_DEPTH_REDUCTION,
224
+ -beta,
225
+ -beta + 1,
226
+ eval_fn,
227
+ tt,
228
+ killers,
229
+ history,
230
+ ply + 1,
231
+ )
232
+ board.pop()
233
+ if null_score >= beta:
234
+ return beta
235
 
236
  best_score = -MATE_SCORE
237
+ best_move: chess.Move | None = None
238
+ killer_moves = killers.get(ply, ())
239
+ for move_index, move in enumerate(_ordered_moves(board, tt_move=tt_move, killer_moves=killer_moves, history=history)):
240
  board.push(move)
241
+ if move_index == 0:
242
+ score = -negamax(board, depth - 1, -beta, -alpha, eval_fn, tt, killers, history, ply + 1)
243
+ else:
244
+ reduced_depth = depth - 1
245
+ if (
246
+ depth >= LMR_MIN_DEPTH
247
+ and move_index >= LMR_MIN_MOVE_INDEX
248
+ and not board.is_check()
249
+ and not board.is_capture(move)
250
+ and move.promotion is None
251
+ ):
252
+ reduced_depth -= 1
253
+ score = -negamax(board, reduced_depth, -alpha - 1, -alpha, eval_fn, tt, killers, history, ply + 1)
254
+ if alpha < score < beta:
255
+ score = -negamax(board, depth - 1, -beta, -alpha, eval_fn, tt, killers, history, ply + 1)
256
  board.pop()
257
 
258
  if score > best_score:
259
  best_score = score
260
+ best_move = move
261
  if best_score > alpha:
262
  alpha = best_score
263
  if alpha >= beta:
264
+ if not board.is_capture(move) and move.promotion is None:
265
+ _record_killer(killers, ply, move)
266
+ _record_history(history, board, move, depth)
267
  break
268
 
269
+ bound = TT_EXACT
270
+ if best_score <= alpha_orig:
271
+ bound = TT_UPPER
272
+ elif best_score >= beta:
273
+ bound = TT_LOWER
274
+ if len(tt) >= MAX_TT_ENTRIES:
275
+ tt.clear()
276
+ tt[key] = TTEntry(depth=depth, score=best_score, bound=bound, best_move=best_move)
277
  return best_score
278
 
279
 
 
282
  best_score = -MATE_SCORE
283
  alpha = -MATE_SCORE
284
  beta = MATE_SCORE
285
+ killers: dict[int, tuple[chess.Move, ...]] = {}
286
+ root_entry = _GLOBAL_TT.get(board._transposition_key())
287
+ if root_entry is not None and abs(root_entry.score) < MATE_SCORE // 2:
288
+ alpha = max(-MATE_SCORE, root_entry.score - ASPIRATION_WINDOW)
289
+ beta = min(MATE_SCORE, root_entry.score + ASPIRATION_WINDOW)
290
+ root_moves = _ordered_moves(
291
+ board,
292
+ tt_move=root_entry.best_move if root_entry is not None else None,
293
+ history=_GLOBAL_HISTORY,
294
+ )
295
+ search_depth = _selective_root_depth(board, depth, len(root_moves))
296
+ use_full_window = False
297
 
298
+ for move_index, move in enumerate(root_moves):
299
  board.push(move)
300
+ if move_index == 0:
301
+ score = -negamax(board, search_depth - 1, -beta, -alpha, eval_fn, _GLOBAL_TT, killers, _GLOBAL_HISTORY, 1)
302
+ else:
303
+ reduced_depth = search_depth - 1
304
+ if (
305
+ search_depth >= LMR_MIN_DEPTH
306
+ and move_index >= LMR_MIN_MOVE_INDEX
307
+ and not board.is_check()
308
+ and not board.is_capture(move)
309
+ and move.promotion is None
310
+ ):
311
+ reduced_depth -= 1
312
+ score = -negamax(
313
+ board,
314
+ reduced_depth,
315
+ -alpha - 1,
316
+ -alpha,
317
+ eval_fn,
318
+ _GLOBAL_TT,
319
+ killers,
320
+ _GLOBAL_HISTORY,
321
+ 1,
322
+ )
323
+ if alpha < score < beta:
324
+ score = -negamax(board, search_depth - 1, -beta, -alpha, eval_fn, _GLOBAL_TT, killers, _GLOBAL_HISTORY, 1)
325
+ if not use_full_window and (score <= alpha or score >= beta):
326
+ score = -negamax(
327
+ board,
328
+ search_depth - 1,
329
+ -MATE_SCORE,
330
+ MATE_SCORE,
331
+ eval_fn,
332
+ _GLOBAL_TT,
333
+ killers,
334
+ _GLOBAL_HISTORY,
335
+ 1,
336
+ )
337
+ alpha = -MATE_SCORE
338
+ beta = MATE_SCORE
339
+ use_full_window = True
340
  board.pop()
341
 
342
  if best_move is None or score > best_score:
 
348
  if best_move is None:
349
  raise RuntimeError("no legal move available")
350
  return best_move
 
src/zero960/runtime/episode.py CHANGED
@@ -14,9 +14,21 @@ from zero960.runtime.workspace import WorkspaceManager
14
  @dataclass(slots=True)
15
  class EpisodeConfig:
16
  max_steps: int = 6
17
- search_depth: int = 2
18
- training_games: int = 2
19
  crash_penalty: float = 0.25
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
 
22
  class Zero960EpisodeRuntime:
@@ -28,6 +40,11 @@ class Zero960EpisodeRuntime:
28
  self.steps_taken = 0
29
  self.invalid_edit_count = 0
30
  self.last_match_score: float | None = None
 
 
 
 
 
31
 
32
  def reset(self, chess960_index: int | None = None) -> RuntimeObservation:
33
  self.close()
@@ -37,6 +54,11 @@ class Zero960EpisodeRuntime:
37
  self.steps_taken = 0
38
  self.invalid_edit_count = 0
39
  self.last_match_score = None
 
 
 
 
 
40
  return self._observation("episode reset")
41
 
42
  def close(self) -> None:
@@ -50,6 +72,7 @@ class Zero960EpisodeRuntime:
50
 
51
  done = False
52
  reward: float | None = None
 
53
  status_message = ""
54
  info: dict[str, object] = {}
55
 
@@ -59,11 +82,29 @@ class Zero960EpisodeRuntime:
59
  raise ValueError("read_file requires path")
60
  content = self.workspace.read_file(action.path)
61
  status_message = f"read {action.path} ({len(content)} bytes)"
 
 
 
62
  elif action.action_type == "write_file":
63
  if action.path is None or action.content is None:
64
  raise ValueError("write_file requires path and content")
 
65
  self.workspace.write_file(action.path, action.content)
66
- status_message = f"wrote {action.path}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  elif action.action_type == "run_static_eval":
68
  eval_fn = self.workspace.load_eval_function()
69
  board = chess.Board.from_chess960_pos(self.start_position)
@@ -71,10 +112,20 @@ class Zero960EpisodeRuntime:
71
  score = eval_fn(board)
72
  status_message = f"static eval score={score}"
73
  info["static_eval_score"] = score
 
 
 
 
74
  elif action.action_type == "run_match":
75
  self.last_match_score = self._run_training_match()
 
76
  status_message = f"match score={self.last_match_score:.3f}"
77
  info["match_score"] = self.last_match_score
 
 
 
 
 
78
  elif action.action_type == "finish":
79
  reward = self._final_reward()
80
  done = True
@@ -86,8 +137,13 @@ class Zero960EpisodeRuntime:
86
  status_message = f"action failed: {exc}"
87
  info["error"] = str(exc)
88
 
 
 
 
 
89
  self.history.append(f"{action.action_type}: {status_message}")
90
  self.steps_taken += 1
 
91
 
92
  if not done and self.steps_taken >= self.config.max_steps:
93
  reward = self._final_reward()
@@ -113,8 +169,17 @@ class Zero960EpisodeRuntime:
113
  def _final_reward(self) -> float:
114
  if self.last_match_score is None:
115
  self.last_match_score = self._run_training_match()
 
 
 
 
 
 
 
 
 
116
  penalty = self.invalid_edit_count * self.config.crash_penalty
117
- return self.last_match_score - penalty
118
 
119
  def _observation(
120
  self,
@@ -124,10 +189,12 @@ class Zero960EpisodeRuntime:
124
  ) -> RuntimeObservation:
125
  if self.workspace is None:
126
  raise RuntimeError("workspace unavailable")
 
127
  return RuntimeObservation(
128
  task=(
129
  "Improve eval.py for the current Chess960 engine. "
130
- "Use bounded file edits and finish when ready for scoring."
 
131
  ),
132
  status_message=status_message,
133
  file_contents={"eval.py": self.workspace.read_file("eval.py")},
@@ -136,7 +203,32 @@ class Zero960EpisodeRuntime:
136
  remaining_steps=max(self.config.max_steps - self.steps_taken, 0),
137
  last_match_score=self.last_match_score,
138
  invalid_edit_count=self.invalid_edit_count,
 
 
 
 
139
  reward=reward,
140
  done=done,
141
  )
142
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  @dataclass(slots=True)
15
  class EpisodeConfig:
16
  max_steps: int = 6
17
+ search_depth: int = 1
18
+ training_games: int = 1
19
  crash_penalty: float = 0.25
20
+ valid_write_bonus: float = 0.20
21
+ changed_write_bonus: float = 0.10
22
+ unchanged_write_penalty: float = 0.10
23
+ explicit_match_bonus: float = 0.15
24
+ finish_after_match_bonus: float = 0.05
25
+ repeated_static_eval_penalty: float = 0.15
26
+ static_eval_before_write_penalty: float = 0.20
27
+ redundant_read_penalty: float = 0.25
28
+ match_without_edit_penalty: float = 0.15
29
+ finish_without_edit_penalty: float = 0.45
30
+ finish_without_match_penalty: float = 0.20
31
+ finish_without_retest_penalty: float = 0.08
32
 
33
 
34
  class Zero960EpisodeRuntime:
 
40
  self.steps_taken = 0
41
  self.invalid_edit_count = 0
42
  self.last_match_score: float | None = None
43
+ self.has_valid_edit = False
44
+ self.has_run_match = False
45
+ self.wrote_since_match = False
46
+ self.shaping_reward_total = 0.0
47
+ self.last_action_type: str | None = None
48
 
49
  def reset(self, chess960_index: int | None = None) -> RuntimeObservation:
50
  self.close()
 
54
  self.steps_taken = 0
55
  self.invalid_edit_count = 0
56
  self.last_match_score = None
57
+ self.has_valid_edit = False
58
+ self.has_run_match = False
59
+ self.wrote_since_match = False
60
+ self.shaping_reward_total = 0.0
61
+ self.last_action_type = None
62
  return self._observation("episode reset")
63
 
64
  def close(self) -> None:
 
72
 
73
  done = False
74
  reward: float | None = None
75
+ step_reward = 0.0
76
  status_message = ""
77
  info: dict[str, object] = {}
78
 
 
82
  raise ValueError("read_file requires path")
83
  content = self.workspace.read_file(action.path)
84
  status_message = f"read {action.path} ({len(content)} bytes)"
85
+ if action.path == "eval.py":
86
+ step_reward -= self.config.redundant_read_penalty
87
+ status_message += "; eval.py was already visible"
88
  elif action.action_type == "write_file":
89
  if action.path is None or action.content is None:
90
  raise ValueError("write_file requires path and content")
91
+ previous_content = self.workspace.read_file(action.path)
92
  self.workspace.write_file(action.path, action.content)
93
+ try:
94
+ self.workspace.load_eval_function()
95
+ except Exception:
96
+ self.workspace.write_file(action.path, previous_content)
97
+ raise
98
+
99
+ if action.content == previous_content:
100
+ step_reward -= self.config.unchanged_write_penalty
101
+ status_message = f"wrote {action.path}; file unchanged"
102
+ else:
103
+ step_reward += self.config.valid_write_bonus + self.config.changed_write_bonus
104
+ self.has_valid_edit = True
105
+ self.wrote_since_match = True
106
+ status_message = f"wrote {action.path}; validated evaluate(board)"
107
+ info["code_changed"] = True
108
  elif action.action_type == "run_static_eval":
109
  eval_fn = self.workspace.load_eval_function()
110
  board = chess.Board.from_chess960_pos(self.start_position)
 
112
  score = eval_fn(board)
113
  status_message = f"static eval score={score}"
114
  info["static_eval_score"] = score
115
+ if not self.has_valid_edit:
116
+ step_reward -= self.config.static_eval_before_write_penalty
117
+ if self.last_action_type == "run_static_eval":
118
+ step_reward -= self.config.repeated_static_eval_penalty
119
  elif action.action_type == "run_match":
120
  self.last_match_score = self._run_training_match()
121
+ self.has_run_match = True
122
  status_message = f"match score={self.last_match_score:.3f}"
123
  info["match_score"] = self.last_match_score
124
+ if self.has_valid_edit and self.wrote_since_match:
125
+ step_reward += self.config.explicit_match_bonus
126
+ self.wrote_since_match = False
127
+ elif not self.has_valid_edit:
128
+ step_reward -= self.config.match_without_edit_penalty
129
  elif action.action_type == "finish":
130
  reward = self._final_reward()
131
  done = True
 
137
  status_message = f"action failed: {exc}"
138
  info["error"] = str(exc)
139
 
140
+ if not done:
141
+ reward = step_reward
142
+ self.shaping_reward_total += step_reward
143
+
144
  self.history.append(f"{action.action_type}: {status_message}")
145
  self.steps_taken += 1
146
+ self.last_action_type = action.action_type
147
 
148
  if not done and self.steps_taken >= self.config.max_steps:
149
  reward = self._final_reward()
 
169
  def _final_reward(self) -> float:
170
  if self.last_match_score is None:
171
  self.last_match_score = self._run_training_match()
172
+ reward = self.last_match_score + self.shaping_reward_total
173
+ if self.has_run_match:
174
+ reward += self.config.finish_after_match_bonus
175
+ if not self.has_valid_edit:
176
+ reward -= self.config.finish_without_edit_penalty
177
+ if not self.has_run_match:
178
+ reward -= self.config.finish_without_match_penalty
179
+ if self.wrote_since_match:
180
+ reward -= self.config.finish_without_retest_penalty
181
  penalty = self.invalid_edit_count * self.config.crash_penalty
182
+ return reward - penalty
183
 
184
  def _observation(
185
  self,
 
189
  ) -> RuntimeObservation:
190
  if self.workspace is None:
191
  raise RuntimeError("workspace unavailable")
192
+ workflow_hint, suggested_actions = self._workflow_state()
193
  return RuntimeObservation(
194
  task=(
195
  "Improve eval.py for the current Chess960 engine. "
196
+ "The full file is already visible below. Best loop: write_file a valid replacement, "
197
+ "run_match to test it, then finish. Repeated run_static_eval and early finish are penalized."
198
  ),
199
  status_message=status_message,
200
  file_contents={"eval.py": self.workspace.read_file("eval.py")},
 
203
  remaining_steps=max(self.config.max_steps - self.steps_taken, 0),
204
  last_match_score=self.last_match_score,
205
  invalid_edit_count=self.invalid_edit_count,
206
+ workflow_hint=workflow_hint,
207
+ suggested_actions=suggested_actions,
208
+ has_valid_edit=self.has_valid_edit,
209
+ has_run_match=self.has_run_match,
210
  reward=reward,
211
  done=done,
212
  )
213
 
214
+ def _workflow_state(self) -> tuple[str, list[str]]:
215
+ if not self.has_valid_edit:
216
+ return (
217
+ "eval.py is already shown below. Do not waste a turn on read_file. "
218
+ "Write a full valid replacement for eval.py next.",
219
+ ["write_file", "run_match", "finish"],
220
+ )
221
+ if self.wrote_since_match:
222
+ return (
223
+ "You have a valid untested edit. Run run_match next to measure it.",
224
+ ["run_match", "write_file", "finish"],
225
+ )
226
+ if self.has_run_match:
227
+ return (
228
+ "You have a tested edit. Finish if the score is acceptable, otherwise write_file again.",
229
+ ["finish", "write_file", "run_match"],
230
+ )
231
+ return (
232
+ "A valid edit exists but no explicit match has been run yet. Run run_match next.",
233
+ ["run_match", "finish", "write_file"],
234
+ )
src/zero960/runtime/types.py CHANGED
@@ -23,6 +23,10 @@ class RuntimeObservation:
23
  remaining_steps: int
24
  last_match_score: float | None
25
  invalid_edit_count: int
 
 
 
 
26
  reward: float | None = None
27
  done: bool = False
28
 
@@ -33,4 +37,3 @@ class RuntimeStepResult:
33
  reward: float | None
34
  done: bool
35
  info: dict[str, Any] = field(default_factory=dict)
36
-
 
23
  remaining_steps: int
24
  last_match_score: float | None
25
  invalid_edit_count: int
26
+ workflow_hint: str
27
+ suggested_actions: list[str]
28
+ has_valid_edit: bool
29
+ has_run_match: bool
30
  reward: float | None = None
31
  done: bool = False
32
 
 
37
  reward: float | None
38
  done: bool
39
  info: dict[str, Any] = field(default_factory=dict)
 
src/zero960/workspace_template/eval.py CHANGED
@@ -5,20 +5,385 @@ import chess
5
  PIECE_VALUES = {
6
  chess.PAWN: 100,
7
  chess.KNIGHT: 320,
8
- chess.BISHOP: 330,
9
  chess.ROOK: 500,
10
  chess.QUEEN: 900,
11
  chess.KING: 0,
12
  }
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- def evaluate(board: chess.Board) -> int:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  score = 0
17
  for piece_type, piece_value in PIECE_VALUES.items():
18
- score += piece_value * len(board.pieces(piece_type, chess.WHITE))
19
- score -= piece_value * len(board.pieces(piece_type, chess.BLACK))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- white_center = sum(1 for square in (chess.D4, chess.E4, chess.D5, chess.E5) if board.color_at(square) == chess.WHITE)
22
- black_center = sum(1 for square in (chess.D4, chess.E4, chess.D5, chess.E5) if board.color_at(square) == chess.BLACK)
23
- score += 15 * (white_center - black_center)
24
  return score
 
5
  PIECE_VALUES = {
6
  chess.PAWN: 100,
7
  chess.KNIGHT: 320,
8
+ chess.BISHOP: 335,
9
  chess.ROOK: 500,
10
  chess.QUEEN: 900,
11
  chess.KING: 0,
12
  }
13
 
14
+ CENTER_SQUARES = (chess.D4, chess.E4, chess.D5, chess.E5)
15
+ EXTENDED_CENTER = (
16
+ chess.C3, chess.D3, chess.E3, chess.F3,
17
+ chess.C4, chess.D4, chess.E4, chess.F4,
18
+ chess.C5, chess.D5, chess.E5, chess.F5,
19
+ chess.C6, chess.D6, chess.E6, chess.F6,
20
+ )
21
+ PIECE_MOBILITY_WEIGHTS = {
22
+ chess.KNIGHT: 4,
23
+ chess.BISHOP: 5,
24
+ chess.ROOK: 3,
25
+ chess.QUEEN: 2,
26
+ }
27
+ CENTER_AXIS_BONUS = (0, 1, 2, 3, 3, 2, 1, 0)
28
+ BISHOP_PAIR_BONUS = 35
29
+ ROOK_OPEN_FILE_BONUS = 20
30
+ ROOK_SEMIOPEN_FILE_BONUS = 10
31
+ DOUBLED_PAWN_PENALTY = 18
32
+ ISOLATED_PAWN_PENALTY = 14
33
+ BACK_RANK_MINOR_PENALTY = 10
34
+ CENTER_OCCUPANCY_BONUS = 14
35
+ CENTER_ATTACK_BONUS = 3
36
+ CASTLING_RIGHTS_BONUS = 12
37
+ CASTLED_BONUS = 18
38
+ KNIGHT_CENTER_BONUS = 6
39
+ BISHOP_CENTER_BONUS = 2
40
+ KING_ENDGAME_CENTER_BONUS = 5
41
+ UNDEFENDED_TARGET_DIVISOR = 16
42
+ OVERLOADED_TARGET_DIVISOR = 24
43
+ TEMPO_BONUS = 8
44
+ PASSED_PAWN_BONUS_BY_RANK = [0, 5, 10, 18, 28, 42, 60, 0]
45
+ LOOSE_PIECE_DIVISOR = 24
46
+ OUTNUMBERED_PIECE_DIVISOR = 40
47
+ PAWN_HARASSMENT_PENALTY = 8
48
+ CONNECTED_PAWN_BONUS = 4
49
+ PAWN_CHAIN_BONUS = 5
50
+ CENTRAL_PAWN_DUO_BONUS = 10
51
+ ADVANCED_CENTRAL_PAWN_BONUS = 3
52
 
53
+
54
+ def _phase(board: chess.Board) -> int:
55
+ phase = 0
56
+ phase += 4 * (len(board.pieces(chess.QUEEN, chess.WHITE)) + len(board.pieces(chess.QUEEN, chess.BLACK)))
57
+ phase += 2 * (len(board.pieces(chess.ROOK, chess.WHITE)) + len(board.pieces(chess.ROOK, chess.BLACK)))
58
+ phase += len(board.pieces(chess.BISHOP, chess.WHITE)) + len(board.pieces(chess.BISHOP, chess.BLACK))
59
+ phase += len(board.pieces(chess.KNIGHT, chess.WHITE)) + len(board.pieces(chess.KNIGHT, chess.BLACK))
60
+ return min(phase, 24)
61
+
62
+
63
+ def _friendly(square: int, color: chess.Color, board: chess.Board) -> bool:
64
+ return board.color_at(square) == color
65
+
66
+
67
+ def _center_axis_score(square: int) -> int:
68
+ return CENTER_AXIS_BONUS[chess.square_file(square)] + CENTER_AXIS_BONUS[chess.square_rank(square)]
69
+
70
+
71
+ def _file_pawn_counts(board: chess.Board, color: chess.Color) -> list[int]:
72
+ counts = [0] * 8
73
+ for square in board.pieces(chess.PAWN, color):
74
+ counts[chess.square_file(square)] += 1
75
+ return counts
76
+
77
+
78
+ def _pawn_structure_score(board: chess.Board, color: chess.Color) -> int:
79
+ score = 0
80
+ pawns = sorted(board.pieces(chess.PAWN, color))
81
+ enemy_pawns = list(board.pieces(chess.PAWN, not color))
82
+ file_counts = _file_pawn_counts(board, color)
83
+
84
+ for count in file_counts:
85
+ if count > 1:
86
+ score -= DOUBLED_PAWN_PENALTY * (count - 1)
87
+
88
+ for square in pawns:
89
+ file_index = chess.square_file(square)
90
+ left_count = file_counts[file_index - 1] if file_index > 0 else 0
91
+ right_count = file_counts[file_index + 1] if file_index < 7 else 0
92
+ if left_count == 0 and right_count == 0:
93
+ score -= ISOLATED_PAWN_PENALTY
94
+
95
+ rank_index = chess.square_rank(square)
96
+ blocked = False
97
+ for enemy_square in enemy_pawns:
98
+ enemy_file = chess.square_file(enemy_square)
99
+ if abs(enemy_file - file_index) > 1:
100
+ continue
101
+ enemy_rank = chess.square_rank(enemy_square)
102
+ if color == chess.WHITE and enemy_rank > rank_index:
103
+ blocked = True
104
+ break
105
+ if color == chess.BLACK and enemy_rank < rank_index:
106
+ blocked = True
107
+ break
108
+ if not blocked:
109
+ advance = rank_index if color == chess.WHITE else 7 - rank_index
110
+ score += PASSED_PAWN_BONUS_BY_RANK[advance]
111
+
112
+ return score
113
+
114
+
115
+ def _mobility_score(board: chess.Board, color: chess.Color) -> int:
116
+ score = 0
117
+ friendly_mask = board.occupied_co[color]
118
+ for piece_type, weight in PIECE_MOBILITY_WEIGHTS.items():
119
+ for square in board.pieces(piece_type, color):
120
+ attacks = board.attacks_mask(square) & ~friendly_mask
121
+ score += weight * chess.popcount(attacks)
122
+ return score
123
+
124
+
125
+ def _center_score(board: chess.Board, color: chess.Color) -> int:
126
+ score = 0
127
+ for square in CENTER_SQUARES:
128
+ if _friendly(square, color, board):
129
+ score += CENTER_OCCUPANCY_BONUS
130
+
131
+ for square in EXTENDED_CENTER:
132
+ score += CENTER_ATTACK_BONUS * chess.popcount(board.attackers_mask(color, square))
133
+ return score
134
+
135
+
136
+ def _rook_file_score(board: chess.Board, color: chess.Color) -> int:
137
+ score = 0
138
+ friendly_pawns = board.pieces(chess.PAWN, color)
139
+ enemy_pawns = board.pieces(chess.PAWN, not color)
140
+ for square in board.pieces(chess.ROOK, color):
141
+ file_index = chess.square_file(square)
142
+ friendly_on_file = any(chess.square_file(pawn_square) == file_index for pawn_square in friendly_pawns)
143
+ enemy_on_file = any(chess.square_file(pawn_square) == file_index for pawn_square in enemy_pawns)
144
+ if not friendly_on_file:
145
+ score += ROOK_SEMIOPEN_FILE_BONUS
146
+ if not enemy_on_file:
147
+ score += ROOK_OPEN_FILE_BONUS
148
+ return score
149
+
150
+
151
+ def _king_safety_score(board: chess.Board, color: chess.Color, phase: int) -> int:
152
+ king_square = board.king(color)
153
+ if king_square is None:
154
+ return 0
155
+
156
+ score = 0
157
+ king_file = chess.square_file(king_square)
158
+ king_rank = chess.square_rank(king_square)
159
+
160
+ for file_index in range(max(0, king_file - 1), min(7, king_file + 1) + 1):
161
+ shelter_ranks = [king_rank + 1, king_rank + 2] if color == chess.WHITE else [king_rank - 1, king_rank - 2]
162
+ for rank_index in shelter_ranks:
163
+ if 0 <= rank_index < 8 and _friendly(chess.square(file_index, rank_index), color, board):
164
+ score += 4
165
+
166
+ enemy_pressure = 0
167
+ for square in chess.SquareSet(chess.BB_KING_ATTACKS[king_square]):
168
+ enemy_pressure += chess.popcount(board.attackers_mask(not color, square))
169
+ score -= enemy_pressure * (2 + phase // 8)
170
+
171
+ if board.has_castling_rights(color):
172
+ score += CASTLING_RIGHTS_BONUS * phase // 24
173
+ return score
174
+
175
+
176
+ def _development_score(board: chess.Board, color: chess.Color, phase: int) -> int:
177
+ if phase <= 8:
178
+ return 0
179
+
180
+ home_rank = 0 if color == chess.WHITE else 7
181
+ penalty = 0
182
+ for piece_type in (chess.KNIGHT, chess.BISHOP):
183
+ for square in board.pieces(piece_type, color):
184
+ if chess.square_rank(square) == home_rank:
185
+ penalty += BACK_RANK_MINOR_PENALTY
186
+ return -penalty
187
+
188
+
189
+ def _base_piece_safety_score(board: chess.Board, color: chess.Color) -> int:
190
+ score = 0
191
+ for piece_type in (chess.KNIGHT, chess.BISHOP, chess.ROOK, chess.QUEEN):
192
+ for square in board.pieces(piece_type, color):
193
+ attackers_mask = board.attackers_mask(not color, square)
194
+ if not attackers_mask:
195
+ continue
196
+
197
+ attackers = chess.popcount(attackers_mask)
198
+ defenders = chess.popcount(board.attackers_mask(color, square))
199
+ if defenders == 0:
200
+ score -= PIECE_VALUES[piece_type] // LOOSE_PIECE_DIVISOR
201
+ elif attackers > defenders:
202
+ score -= PIECE_VALUES[piece_type] // OUTNUMBERED_PIECE_DIVISOR
203
+
204
+ if defenders <= attackers:
205
+ pawn_pressure = 0
206
+ for attacker_square in chess.SquareSet(attackers_mask):
207
+ if board.piece_type_at(attacker_square) == chess.PAWN:
208
+ pawn_pressure += 1
209
+ score -= pawn_pressure * PAWN_HARASSMENT_PENALTY
210
+ return score
211
+
212
+
213
+ def _base_piece_placement_score(board: chess.Board, color: chess.Color, phase: int) -> int:
214
+ score = 0
215
+ for square in board.pieces(chess.KNIGHT, color):
216
+ score += KNIGHT_CENTER_BONUS * _center_axis_score(square)
217
+ for square in board.pieces(chess.BISHOP, color):
218
+ score += BISHOP_CENTER_BONUS * _center_axis_score(square)
219
+
220
+ king_square = board.king(color)
221
+ if king_square is not None:
222
+ back_rank = 0 if color == chess.WHITE else 7
223
+ if chess.square_rank(king_square) == back_rank:
224
+ if king_square in (chess.C1, chess.G1, chess.C8, chess.G8):
225
+ rook_square = chess.square(3 if chess.square_file(king_square) == 2 else 5, back_rank)
226
+ if _friendly(rook_square, color, board):
227
+ score += CASTLED_BONUS * phase // 24
228
+ score += KING_ENDGAME_CENTER_BONUS * _center_axis_score(king_square) * (24 - phase) // 24
229
+ return score
230
+
231
+
232
+ def _base_threat_score(board: chess.Board, color: chess.Color) -> int:
233
+ score = 0
234
+ for piece_type in (chess.PAWN, chess.KNIGHT, chess.BISHOP, chess.ROOK, chess.QUEEN):
235
+ for square in board.pieces(piece_type, not color):
236
+ attackers = chess.popcount(board.attackers_mask(color, square))
237
+ if attackers == 0:
238
+ continue
239
+ defenders = chess.popcount(board.attackers_mask(not color, square))
240
+ if defenders == 0:
241
+ score += PIECE_VALUES[piece_type] // UNDEFENDED_TARGET_DIVISOR
242
+ elif attackers > defenders:
243
+ score += PIECE_VALUES[piece_type] // OVERLOADED_TARGET_DIVISOR
244
+ return score
245
+
246
+
247
+ def _structure_hook(board: chess.Board, color: chess.Color, phase: int) -> int:
248
+ """Swarm lane: structure and castling heuristics."""
249
+ # SWARM_HOOK: structure
250
+ score = 0
251
+ pawns = board.pieces(chess.PAWN, color)
252
+ direction = 1 if color == chess.WHITE else -1
253
+
254
+ for square in pawns:
255
+ file_index = chess.square_file(square)
256
+ rank_index = chess.square_rank(square)
257
+
258
+ for neighbor_file in (file_index - 1, file_index + 1):
259
+ if 0 <= neighbor_file < 8:
260
+ neighbor_square = chess.square(neighbor_file, rank_index)
261
+ if neighbor_square in pawns:
262
+ score += CONNECTED_PAWN_BONUS
263
+
264
+ support_rank = rank_index - direction
265
+ if 0 <= support_rank < 8:
266
+ for support_file in (file_index - 1, file_index + 1):
267
+ if 0 <= support_file < 8:
268
+ support_square = chess.square(support_file, support_rank)
269
+ if support_square in pawns:
270
+ score += PAWN_CHAIN_BONUS
271
+
272
+ if file_index in (3, 4):
273
+ advance = rank_index if color == chess.WHITE else 7 - rank_index
274
+ if advance >= 3:
275
+ score += ADVANCED_CENTRAL_PAWN_BONUS
276
+
277
+ if chess.D4 in pawns and chess.E4 in pawns:
278
+ score += CENTRAL_PAWN_DUO_BONUS
279
+ if chess.D5 in pawns and chess.E5 in pawns:
280
+ score += CENTRAL_PAWN_DUO_BONUS
281
+
282
+ return score * phase // 24
283
+
284
+
285
+ def _tactical_hook(board: chess.Board, color: chess.Color, phase: int) -> int:
286
+ """Swarm lane: tactical safety and loose-piece pressure."""
287
+ # SWARM_HOOK: tactical
288
+ score = _base_piece_safety_score(board, color)
289
+ tactical_values = {
290
+ chess.PAWN: 100,
291
+ chess.KNIGHT: 320,
292
+ chess.BISHOP: 335,
293
+ chess.ROOK: 500,
294
+ chess.QUEEN: 900,
295
+ chess.KING: 1200,
296
+ }
297
+
298
+ def _least_tactical_value(attackers_mask: int) -> int:
299
+ least = 10_000
300
+ for attacker_square in chess.SquareSet(attackers_mask):
301
+ piece_type = board.piece_type_at(attacker_square)
302
+ if piece_type is None:
303
+ continue
304
+ value = tactical_values[piece_type]
305
+ if value < least:
306
+ least = value
307
+ return least
308
+
309
+ def _exchange_edge(attacking_color: chess.Color, square: int, piece_type: int) -> int:
310
+ attackers_mask = board.attackers_mask(attacking_color, square)
311
+ if not attackers_mask:
312
+ return 0
313
+
314
+ least_attacker = _least_tactical_value(attackers_mask)
315
+ target_value = PIECE_VALUES[piece_type]
316
+ if least_attacker >= target_value:
317
+ return 0
318
+
319
+ defenders_mask = board.attackers_mask(not attacking_color, square)
320
+ edge = 0
321
+ if not defenders_mask:
322
+ edge += (target_value - least_attacker) // 16 + 4
323
+ else:
324
+ least_defender = _least_tactical_value(defenders_mask)
325
+ if least_defender > least_attacker:
326
+ edge += (least_defender - least_attacker) // 24 + 1
327
+ if chess.popcount(attackers_mask) > chess.popcount(defenders_mask):
328
+ edge += target_value // 96
329
+ return edge
330
+
331
+ for piece_type in (chess.KNIGHT, chess.BISHOP, chess.ROOK, chess.QUEEN):
332
+ for square in board.pieces(piece_type, not color):
333
+ score += _exchange_edge(color, square, piece_type)
334
+ for square in board.pieces(piece_type, color):
335
+ score -= _exchange_edge(not color, square, piece_type)
336
+
337
+ return score
338
+
339
+
340
+ def _activity_hook(board: chess.Board, color: chess.Color, phase: int) -> int:
341
+ """Swarm lane: activity, centralization, and placement."""
342
+ # SWARM_HOOK: activity
343
+ return _base_piece_placement_score(board, color, phase)
344
+
345
+
346
+ def _pawn_endgame_hook(board: chess.Board, color: chess.Color, phase: int) -> int:
347
+ """Swarm lane: pawn structure and endgame conversion."""
348
+ # SWARM_HOOK: pawn_endgame
349
+ return 0
350
+
351
+
352
+ def _initiative_hook(board: chess.Board, color: chess.Color, phase: int) -> int:
353
+ """Swarm lane: threats, tempo conversion, and initiative."""
354
+ # SWARM_HOOK: initiative
355
+ return _base_threat_score(board, color)
356
+
357
+
358
+ def _side_score(board: chess.Board, color: chess.Color, phase: int) -> int:
359
  score = 0
360
  for piece_type, piece_value in PIECE_VALUES.items():
361
+ score += piece_value * len(board.pieces(piece_type, color))
362
+
363
+ if len(board.pieces(chess.BISHOP, color)) >= 2:
364
+ score += BISHOP_PAIR_BONUS
365
+
366
+ score += _pawn_structure_score(board, color)
367
+ score += _mobility_score(board, color)
368
+ score += _center_score(board, color)
369
+ score += _rook_file_score(board, color)
370
+ score += _king_safety_score(board, color, phase)
371
+ score += _development_score(board, color, phase)
372
+ score += _structure_hook(board, color, phase)
373
+ score += _tactical_hook(board, color, phase)
374
+ score += _activity_hook(board, color, phase)
375
+ score += _pawn_endgame_hook(board, color, phase)
376
+ score += _initiative_hook(board, color, phase)
377
+ return score
378
+
379
+
380
+ def evaluate(board: chess.Board) -> int:
381
+ if board.is_checkmate():
382
+ return -100_000 if board.turn == chess.WHITE else 100_000
383
+ if board.is_stalemate() or board.is_insufficient_material():
384
+ return 0
385
 
386
+ phase = _phase(board)
387
+ score = _side_score(board, chess.WHITE, phase) - _side_score(board, chess.BLACK, phase)
388
+ score += TEMPO_BONUS if board.turn == chess.WHITE else -TEMPO_BONUS
389
  return score
src/zero960_env/models.py CHANGED
@@ -21,3 +21,7 @@ class Zero960Observation(Observation):
21
  remaining_steps: int = 0
22
  last_match_score: float | None = None
23
  invalid_edit_count: int = 0
 
 
 
 
 
21
  remaining_steps: int = 0
22
  last_match_score: float | None = None
23
  invalid_edit_count: int = 0
24
+ workflow_hint: str = ""
25
+ suggested_actions: list[str] = Field(default_factory=list)
26
+ has_valid_edit: bool = False
27
+ has_run_match: bool = False
src/zero960_env/server/environment.py CHANGED
@@ -35,6 +35,10 @@ class Zero960Environment(Environment[Zero960Action, Zero960Observation, State]):
35
  remaining_steps=observation.remaining_steps,
36
  last_match_score=observation.last_match_score,
37
  invalid_edit_count=observation.invalid_edit_count,
 
 
 
 
38
  )
39
 
40
  def step(
@@ -61,6 +65,10 @@ class Zero960Environment(Environment[Zero960Action, Zero960Observation, State]):
61
  remaining_steps=obs.remaining_steps,
62
  last_match_score=obs.last_match_score,
63
  invalid_edit_count=obs.invalid_edit_count,
 
 
 
 
64
  reward=obs.reward,
65
  done=obs.done,
66
  )
 
35
  remaining_steps=observation.remaining_steps,
36
  last_match_score=observation.last_match_score,
37
  invalid_edit_count=observation.invalid_edit_count,
38
+ workflow_hint=observation.workflow_hint,
39
+ suggested_actions=observation.suggested_actions,
40
+ has_valid_edit=observation.has_valid_edit,
41
+ has_run_match=observation.has_run_match,
42
  )
43
 
44
  def step(
 
65
  remaining_steps=obs.remaining_steps,
66
  last_match_score=obs.last_match_score,
67
  invalid_edit_count=obs.invalid_edit_count,
68
+ workflow_hint=obs.workflow_hint,
69
+ suggested_actions=obs.suggested_actions,
70
+ has_valid_edit=obs.has_valid_edit,
71
+ has_run_match=obs.has_run_match,
72
  reward=obs.reward,
73
  done=obs.done,
74
  )
train/benchmark_engine.py ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Benchmark two full engine roots so each side uses its own search and eval code."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import importlib.util
7
+ from collections.abc import Callable
8
+ from dataclasses import dataclass
9
+ from pathlib import Path
10
+
11
+ import chess
12
+
13
+ from train.benchmark_eval import BenchmarkResult, _elo_from_score, _sample_positions
14
+
15
+ EvalFn = Callable[[chess.Board], int]
16
+ SelectMoveFn = Callable[[chess.Board, int, EvalFn], chess.Move]
17
+
18
+
19
+ @dataclass(slots=True)
20
+ class EngineHandle:
21
+ root: Path
22
+ eval_path: Path
23
+ search_path: Path
24
+ evaluate: EvalFn
25
+ select_move: SelectMoveFn
26
+
27
+
28
+ def _load_module(path: Path, module_name: str) -> object:
29
+ spec = importlib.util.spec_from_file_location(module_name, path)
30
+ if spec is None or spec.loader is None:
31
+ raise RuntimeError(f"failed to load module from {path}")
32
+ module = importlib.util.module_from_spec(spec)
33
+ spec.loader.exec_module(module)
34
+ return module
35
+
36
+
37
+ def _load_engine(root: Path, eval_rel: str, search_rel: str, label: str) -> EngineHandle:
38
+ eval_path = (root / eval_rel).resolve()
39
+ search_path = (root / search_rel).resolve()
40
+ eval_module = _load_module(eval_path, f"zero960_eval_{label}")
41
+ search_module = _load_module(search_path, f"zero960_search_{label}")
42
+
43
+ evaluate = getattr(eval_module, "evaluate", None)
44
+ select_move = getattr(search_module, "select_move", None)
45
+ if evaluate is None or not callable(evaluate):
46
+ raise RuntimeError(f"{eval_path} does not define evaluate(board)")
47
+ if select_move is None or not callable(select_move):
48
+ raise RuntimeError(f"{search_path} does not define select_move(board, depth, eval_fn)")
49
+
50
+ return EngineHandle(
51
+ root=root.resolve(),
52
+ eval_path=eval_path,
53
+ search_path=search_path,
54
+ evaluate=evaluate,
55
+ select_move=select_move,
56
+ )
57
+
58
+
59
+ def _new_board(chess960_index: int) -> chess.Board:
60
+ board = chess.Board.from_chess960_pos(chess960_index)
61
+ board.chess960 = True
62
+ return board
63
+
64
+
65
+ def _play_game(
66
+ chess960_index: int,
67
+ white_engine: EngineHandle,
68
+ black_engine: EngineHandle,
69
+ *,
70
+ depth: int,
71
+ max_plies: int,
72
+ ) -> float:
73
+ board = _new_board(chess960_index)
74
+
75
+ for _ in range(max_plies):
76
+ if board.is_game_over(claim_draw=True):
77
+ break
78
+ engine = white_engine if board.turn == chess.WHITE else black_engine
79
+ move = engine.select_move(board, depth=depth, eval_fn=engine.evaluate)
80
+ board.push(move)
81
+
82
+ result = board.result(claim_draw=True)
83
+ if result == "1-0":
84
+ return 1.0
85
+ if result == "0-1":
86
+ return 0.0
87
+ return 0.5
88
+
89
+
90
+ def benchmark_engine_roots(
91
+ candidate_root: Path,
92
+ baseline_root: Path,
93
+ *,
94
+ candidate_eval_rel: str = "src/zero960/workspace_template/eval.py",
95
+ baseline_eval_rel: str = "src/zero960/workspace_template/eval.py",
96
+ candidate_search_rel: str = "src/zero960/engine/search.py",
97
+ baseline_search_rel: str = "src/zero960/engine/search.py",
98
+ positions: int = 64,
99
+ depth: int = 2,
100
+ max_plies: int = 120,
101
+ seed: int = 42,
102
+ ) -> BenchmarkResult:
103
+ candidate = _load_engine(candidate_root, candidate_eval_rel, candidate_search_rel, "candidate")
104
+ baseline = _load_engine(baseline_root, baseline_eval_rel, baseline_search_rel, "baseline")
105
+ start_positions = _sample_positions(positions, seed)
106
+
107
+ wins = 0
108
+ draws = 0
109
+ losses = 0
110
+ points = 0.0
111
+
112
+ for chess960_index in start_positions:
113
+ white_result = _play_game(
114
+ chess960_index,
115
+ candidate,
116
+ baseline,
117
+ depth=depth,
118
+ max_plies=max_plies,
119
+ )
120
+ points += white_result
121
+ if white_result == 1.0:
122
+ wins += 1
123
+ elif white_result == 0.5:
124
+ draws += 1
125
+ else:
126
+ losses += 1
127
+
128
+ black_result = 1.0 - _play_game(
129
+ chess960_index,
130
+ baseline,
131
+ candidate,
132
+ depth=depth,
133
+ max_plies=max_plies,
134
+ )
135
+ points += black_result
136
+ if black_result == 1.0:
137
+ wins += 1
138
+ elif black_result == 0.5:
139
+ draws += 1
140
+ else:
141
+ losses += 1
142
+
143
+ total_games = len(start_positions) * 2
144
+ score = points / total_games if total_games else 0.0
145
+ return BenchmarkResult(
146
+ candidate_path=candidate.root,
147
+ baseline_path=baseline.root,
148
+ positions=len(start_positions),
149
+ depth=depth,
150
+ max_plies=max_plies,
151
+ seed=seed,
152
+ wins=wins,
153
+ draws=draws,
154
+ losses=losses,
155
+ points=points,
156
+ total_games=total_games,
157
+ score=score,
158
+ elo_delta_estimate=_elo_from_score(score),
159
+ )
160
+
161
+
162
+ def parse_args() -> argparse.Namespace:
163
+ root = Path(__file__).resolve().parents[1]
164
+ parser = argparse.ArgumentParser(description=__doc__)
165
+ parser.add_argument("--candidate-root", default=str(root))
166
+ parser.add_argument("--baseline-root", default=str(root))
167
+ parser.add_argument("--candidate-eval-rel", default="src/zero960/workspace_template/eval.py")
168
+ parser.add_argument("--baseline-eval-rel", default="src/zero960/workspace_template/eval.py")
169
+ parser.add_argument("--candidate-search-rel", default="src/zero960/engine/search.py")
170
+ parser.add_argument("--baseline-search-rel", default="src/zero960/engine/search.py")
171
+ parser.add_argument("--positions", type=int, default=64)
172
+ parser.add_argument("--depth", type=int, default=2)
173
+ parser.add_argument("--max-plies", type=int, default=120)
174
+ parser.add_argument("--seed", type=int, default=42)
175
+ return parser.parse_args()
176
+
177
+
178
+ def main() -> None:
179
+ args = parse_args()
180
+ result = benchmark_engine_roots(
181
+ Path(args.candidate_root).resolve(),
182
+ Path(args.baseline_root).resolve(),
183
+ candidate_eval_rel=args.candidate_eval_rel,
184
+ baseline_eval_rel=args.baseline_eval_rel,
185
+ candidate_search_rel=args.candidate_search_rel,
186
+ baseline_search_rel=args.baseline_search_rel,
187
+ positions=args.positions,
188
+ depth=args.depth,
189
+ max_plies=args.max_plies,
190
+ seed=args.seed,
191
+ )
192
+
193
+ print(f"candidate_root: {result.candidate_path}")
194
+ print(f"baseline_root: {result.baseline_path}")
195
+ print(
196
+ f"positions={result.positions} depth={result.depth} max_plies={result.max_plies} "
197
+ f"games={result.total_games} seed={result.seed}"
198
+ )
199
+ print(
200
+ f"record={result.wins}-{result.draws}-{result.losses} "
201
+ f"points={result.points:.1f}/{result.total_games}"
202
+ )
203
+ print(f"score={result.score:.3f} elo_delta_estimate={result.elo_delta_estimate:.1f}")
204
+
205
+
206
+ if __name__ == "__main__":
207
+ main()
train/benchmark_eval.py ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Benchmark two Chess960 eval functions against each other."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import importlib.util
7
+ import math
8
+ import random
9
+ from collections.abc import Callable
10
+ from dataclasses import asdict, dataclass
11
+ from pathlib import Path
12
+
13
+ import chess
14
+
15
+ from zero960.engine.match import play_game
16
+
17
+ EvalFn = Callable[[chess.Board], int]
18
+
19
+
20
+ @dataclass(slots=True)
21
+ class BenchmarkResult:
22
+ candidate_path: Path
23
+ baseline_path: Path
24
+ positions: int
25
+ depth: int
26
+ max_plies: int
27
+ seed: int
28
+ wins: int
29
+ draws: int
30
+ losses: int
31
+ points: float
32
+ total_games: int
33
+ score: float
34
+ elo_delta_estimate: float
35
+
36
+ def to_json(self) -> dict[str, object]:
37
+ payload = asdict(self)
38
+ payload["candidate_path"] = str(self.candidate_path)
39
+ payload["baseline_path"] = str(self.baseline_path)
40
+ return payload
41
+
42
+
43
+ def _load_eval(path: Path) -> EvalFn:
44
+ spec = importlib.util.spec_from_file_location(f"zero960_benchmark_{path.stem}", path)
45
+ if spec is None or spec.loader is None:
46
+ raise RuntimeError(f"failed to load module from {path}")
47
+
48
+ module = importlib.util.module_from_spec(spec)
49
+ spec.loader.exec_module(module)
50
+ evaluate = getattr(module, "evaluate", None)
51
+ if evaluate is None or not callable(evaluate):
52
+ raise RuntimeError(f"{path} does not define evaluate(board)")
53
+ return evaluate
54
+
55
+
56
+ def _sample_positions(count: int, seed: int) -> list[int]:
57
+ rng = random.Random(seed)
58
+ population = list(range(960))
59
+ if count <= len(population):
60
+ return rng.sample(population, count)
61
+ return [rng.choice(population) for _ in range(count)]
62
+
63
+
64
+ def _elo_from_score(score: float) -> float:
65
+ clipped = min(max(score, 0.01), 0.99)
66
+ return -400.0 * math.log10((1.0 / clipped) - 1.0)
67
+
68
+
69
+ def benchmark_eval_files(
70
+ candidate_path: Path,
71
+ baseline_path: Path,
72
+ *,
73
+ positions: int = 64,
74
+ depth: int = 2,
75
+ max_plies: int = 120,
76
+ seed: int = 42,
77
+ ) -> BenchmarkResult:
78
+ candidate_eval = _load_eval(candidate_path)
79
+ baseline_eval = _load_eval(baseline_path)
80
+ start_positions = _sample_positions(positions, seed)
81
+
82
+ wins = 0
83
+ draws = 0
84
+ losses = 0
85
+ points = 0.0
86
+
87
+ for chess960_index in start_positions:
88
+ white_result = play_game(
89
+ chess960_index,
90
+ candidate_eval,
91
+ baseline_eval,
92
+ depth=depth,
93
+ max_plies=max_plies,
94
+ )
95
+ points += white_result
96
+ if white_result == 1.0:
97
+ wins += 1
98
+ elif white_result == 0.5:
99
+ draws += 1
100
+ else:
101
+ losses += 1
102
+
103
+ black_result = 1.0 - play_game(
104
+ chess960_index,
105
+ baseline_eval,
106
+ candidate_eval,
107
+ depth=depth,
108
+ max_plies=max_plies,
109
+ )
110
+ points += black_result
111
+ if black_result == 1.0:
112
+ wins += 1
113
+ elif black_result == 0.5:
114
+ draws += 1
115
+ else:
116
+ losses += 1
117
+
118
+ total_games = len(start_positions) * 2
119
+ score = points / total_games if total_games else 0.0
120
+ return BenchmarkResult(
121
+ candidate_path=candidate_path,
122
+ baseline_path=baseline_path,
123
+ positions=len(start_positions),
124
+ depth=depth,
125
+ max_plies=max_plies,
126
+ seed=seed,
127
+ wins=wins,
128
+ draws=draws,
129
+ losses=losses,
130
+ points=points,
131
+ total_games=total_games,
132
+ score=score,
133
+ elo_delta_estimate=_elo_from_score(score),
134
+ )
135
+
136
+
137
+ def parse_args() -> argparse.Namespace:
138
+ root = Path(__file__).resolve().parents[1]
139
+ parser = argparse.ArgumentParser(description="Benchmark two Chess960 eval functions.")
140
+ parser.add_argument(
141
+ "--candidate-file",
142
+ default=str(root / "src/zero960/workspace_template/eval.py"),
143
+ help="Path to the candidate eval.py file.",
144
+ )
145
+ parser.add_argument(
146
+ "--baseline-file",
147
+ default=str(root / "src/zero960/engine/default_eval.py"),
148
+ help="Path to the baseline eval.py file.",
149
+ )
150
+ parser.add_argument("--positions", type=int, default=64)
151
+ parser.add_argument("--depth", type=int, default=2)
152
+ parser.add_argument("--max-plies", type=int, default=120)
153
+ parser.add_argument("--seed", type=int, default=42)
154
+ return parser.parse_args()
155
+
156
+
157
+ def main() -> None:
158
+ args = parse_args()
159
+ candidate_path = Path(args.candidate_file).resolve()
160
+ baseline_path = Path(args.baseline_file).resolve()
161
+ result = benchmark_eval_files(
162
+ candidate_path,
163
+ baseline_path,
164
+ positions=args.positions,
165
+ depth=args.depth,
166
+ max_plies=args.max_plies,
167
+ seed=args.seed,
168
+ )
169
+
170
+ print(f"candidate: {candidate_path}")
171
+ print(f"baseline: {baseline_path}")
172
+ print(
173
+ f"positions={result.positions} depth={result.depth} max_plies={result.max_plies} "
174
+ f"games={result.total_games} seed={result.seed}"
175
+ )
176
+ print(
177
+ f"record={result.wins}-{result.draws}-{result.losses} "
178
+ f"points={result.points:.1f}/{result.total_games}"
179
+ )
180
+ print(f"score={result.score:.3f} elo_delta_estimate={result.elo_delta_estimate:.1f}")
181
+
182
+
183
+ if __name__ == "__main__":
184
+ main()
train/benchmark_league.py ADDED
@@ -0,0 +1,249 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Benchmark a candidate eval against a league of accepted swarm champions."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import json
7
+ from dataclasses import asdict, dataclass
8
+ from pathlib import Path
9
+
10
+ from train.benchmark_eval import BenchmarkResult, benchmark_eval_files
11
+
12
+
13
+ @dataclass(slots=True)
14
+ class LeagueOpponentResult:
15
+ opponent_path: Path
16
+ label: str
17
+ result: BenchmarkResult
18
+
19
+ def to_json(self) -> dict[str, object]:
20
+ payload = asdict(self)
21
+ payload["opponent_path"] = str(self.opponent_path)
22
+ payload["result"] = self.result.to_json()
23
+ return payload
24
+
25
+
26
+ @dataclass(slots=True)
27
+ class LeagueResult:
28
+ candidate_path: Path
29
+ opponents: list[LeagueOpponentResult]
30
+ total_points: float
31
+ total_games: int
32
+ overall_score: float
33
+ overall_elo_delta_estimate: float
34
+
35
+ def to_json(self) -> dict[str, object]:
36
+ return {
37
+ "candidate_path": str(self.candidate_path),
38
+ "opponents": [opponent.to_json() for opponent in self.opponents],
39
+ "total_points": self.total_points,
40
+ "total_games": self.total_games,
41
+ "overall_score": self.overall_score,
42
+ "overall_elo_delta_estimate": self.overall_elo_delta_estimate,
43
+ }
44
+
45
+
46
+ def _repo_root() -> Path:
47
+ return Path(__file__).resolve().parents[1]
48
+
49
+
50
+ def _default_candidate(root: Path) -> Path:
51
+ return root / "outputs" / "codex_swarm" / "champion_eval.py"
52
+
53
+
54
+ def _default_baseline(root: Path) -> Path:
55
+ return root / "src" / "zero960" / "engine" / "default_eval.py"
56
+
57
+
58
+ def _accepted_snapshots(root: Path) -> list[Path]:
59
+ accepted_dir = root / "outputs" / "codex_swarm" / "accepted"
60
+ if not accepted_dir.exists():
61
+ return []
62
+ return sorted(accepted_dir.glob("*_eval.py"))
63
+
64
+
65
+ def _dedupe_paths(paths: list[Path]) -> list[Path]:
66
+ seen: set[Path] = set()
67
+ ordered: list[Path] = []
68
+ for path in paths:
69
+ resolved = path.resolve()
70
+ if resolved in seen or not resolved.exists():
71
+ continue
72
+ seen.add(resolved)
73
+ ordered.append(resolved)
74
+ return ordered
75
+
76
+
77
+ def _same_contents(left: Path, right: Path) -> bool:
78
+ return left.read_text(encoding="utf-8") == right.read_text(encoding="utf-8")
79
+
80
+
81
+ def _label_for_path(root: Path, path: Path) -> str:
82
+ resolved = path.resolve()
83
+ champion = (root / "outputs" / "codex_swarm" / "champion_eval.py").resolve()
84
+ baseline = (root / "src" / "zero960" / "engine" / "default_eval.py").resolve()
85
+ if resolved == champion:
86
+ return "current_champion"
87
+ if resolved == baseline:
88
+ return "original_baseline"
89
+ if "outputs/codex_swarm/accepted" in str(resolved):
90
+ return resolved.stem
91
+ return resolved.stem
92
+
93
+
94
+ def default_league_opponents(
95
+ *,
96
+ candidate_path: Path,
97
+ include_baseline: bool,
98
+ include_champion: bool,
99
+ accepted_limit: int | None,
100
+ ) -> list[Path]:
101
+ root = _repo_root()
102
+ opponents: list[Path] = []
103
+ if include_baseline:
104
+ opponents.append(_default_baseline(root))
105
+ if include_champion:
106
+ opponents.append(_default_candidate(root))
107
+
108
+ accepted = _accepted_snapshots(root)
109
+ if accepted_limit is not None:
110
+ accepted = accepted[-accepted_limit:]
111
+ opponents.extend(accepted)
112
+
113
+ filtered = []
114
+ for path in _dedupe_paths(opponents):
115
+ if path.resolve() == candidate_path.resolve():
116
+ continue
117
+ if _same_contents(path, candidate_path):
118
+ continue
119
+ filtered.append(path)
120
+ return filtered
121
+
122
+
123
+ def benchmark_league(
124
+ candidate_path: Path,
125
+ opponent_paths: list[Path],
126
+ *,
127
+ positions: int,
128
+ depth: int,
129
+ max_plies: int,
130
+ seed: int,
131
+ ) -> LeagueResult:
132
+ root = _repo_root()
133
+ opponent_results: list[LeagueOpponentResult] = []
134
+ total_points = 0.0
135
+ total_games = 0
136
+
137
+ for offset, opponent_path in enumerate(opponent_paths):
138
+ result = benchmark_eval_files(
139
+ candidate_path,
140
+ opponent_path,
141
+ positions=positions,
142
+ depth=depth,
143
+ max_plies=max_plies,
144
+ seed=seed + offset,
145
+ )
146
+ opponent_results.append(
147
+ LeagueOpponentResult(
148
+ opponent_path=opponent_path,
149
+ label=_label_for_path(root, opponent_path),
150
+ result=result,
151
+ )
152
+ )
153
+ total_points += result.points
154
+ total_games += result.total_games
155
+
156
+ overall_score = total_points / total_games if total_games else 0.0
157
+ overall_elo = 0.0
158
+ if total_games:
159
+ from train.benchmark_eval import _elo_from_score # local reuse
160
+
161
+ overall_elo = _elo_from_score(overall_score)
162
+
163
+ return LeagueResult(
164
+ candidate_path=candidate_path,
165
+ opponents=opponent_results,
166
+ total_points=total_points,
167
+ total_games=total_games,
168
+ overall_score=overall_score,
169
+ overall_elo_delta_estimate=overall_elo,
170
+ )
171
+
172
+
173
+ def parse_args() -> argparse.Namespace:
174
+ root = _repo_root()
175
+ parser = argparse.ArgumentParser(description=__doc__)
176
+ parser.add_argument(
177
+ "--candidate-file",
178
+ default=str(_default_candidate(root)),
179
+ help="Path to the candidate eval.py file.",
180
+ )
181
+ parser.add_argument(
182
+ "--opponent-file",
183
+ action="append",
184
+ default=[],
185
+ help="Optional explicit opponent file. Repeat to add more than one.",
186
+ )
187
+ parser.add_argument("--positions", type=int, default=16)
188
+ parser.add_argument("--depth", type=int, default=2)
189
+ parser.add_argument("--max-plies", type=int, default=120)
190
+ parser.add_argument("--seed", type=int, default=42)
191
+ parser.add_argument(
192
+ "--accepted-limit",
193
+ type=int,
194
+ default=4,
195
+ help="How many accepted swarm snapshots to include by default.",
196
+ )
197
+ parser.add_argument("--no-baseline", action="store_true", help="Exclude the original baseline from the league.")
198
+ parser.add_argument("--no-champion", action="store_true", help="Exclude the current champion from the league.")
199
+ parser.add_argument("--json", action="store_true", help="Print the full result as JSON.")
200
+ return parser.parse_args()
201
+
202
+
203
+ def main() -> None:
204
+ args = parse_args()
205
+ candidate_path = Path(args.candidate_file).resolve()
206
+ explicit_opponents = [Path(path).resolve() for path in args.opponent_file]
207
+
208
+ opponents = _dedupe_paths(explicit_opponents)
209
+ if not opponents:
210
+ opponents = default_league_opponents(
211
+ candidate_path=candidate_path,
212
+ include_baseline=not args.no_baseline,
213
+ include_champion=not args.no_champion,
214
+ accepted_limit=args.accepted_limit,
215
+ )
216
+
217
+ if not opponents:
218
+ raise SystemExit("No league opponents found.")
219
+
220
+ result = benchmark_league(
221
+ candidate_path,
222
+ opponents,
223
+ positions=args.positions,
224
+ depth=args.depth,
225
+ max_plies=args.max_plies,
226
+ seed=args.seed,
227
+ )
228
+
229
+ if args.json:
230
+ print(json.dumps(result.to_json(), indent=2, sort_keys=True))
231
+ return
232
+
233
+ print(f"candidate: {candidate_path}")
234
+ print(f"league opponents: {len(result.opponents)}")
235
+ for opponent in result.opponents:
236
+ benchmark = opponent.result
237
+ print(
238
+ f"- {opponent.label}: record={benchmark.wins}-{benchmark.draws}-{benchmark.losses} "
239
+ f"points={benchmark.points:.1f}/{benchmark.total_games} score={benchmark.score:.3f} "
240
+ f"elo_delta_estimate={benchmark.elo_delta_estimate:.1f}"
241
+ )
242
+ print(
243
+ f"overall: points={result.total_points:.1f}/{result.total_games} "
244
+ f"score={result.overall_score:.3f} elo_delta_estimate={result.overall_elo_delta_estimate:.1f}"
245
+ )
246
+
247
+
248
+ if __name__ == "__main__":
249
+ main()
train/benchmark_uci.py ADDED
@@ -0,0 +1,307 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Benchmark a local Chess960 eval file against a UCI engine such as Stockfish."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import importlib.util
7
+ import math
8
+ import random
9
+ from collections.abc import Callable
10
+ from dataclasses import asdict, dataclass
11
+ from pathlib import Path
12
+
13
+ import chess
14
+ import chess.engine
15
+
16
+ from zero960.engine.search import select_move
17
+
18
+ EvalFn = Callable[[chess.Board], int]
19
+
20
+
21
+ @dataclass(slots=True)
22
+ class UciBenchmarkResult:
23
+ candidate_path: Path
24
+ engine_command: str
25
+ engine_options: dict[str, bool | int | float | str]
26
+ positions: int
27
+ max_plies: int
28
+ seed: int
29
+ candidate_depth: int | None
30
+ candidate_nodes: int | None
31
+ engine_depth: int | None
32
+ engine_nodes: int | None
33
+ wins: int
34
+ draws: int
35
+ losses: int
36
+ points: float
37
+ total_games: int
38
+ score: float
39
+ elo_delta_estimate: float
40
+
41
+ def to_json(self) -> dict[str, object]:
42
+ payload = asdict(self)
43
+ payload["candidate_path"] = str(self.candidate_path)
44
+ return payload
45
+
46
+
47
+ def _load_eval(path: Path) -> EvalFn:
48
+ spec = importlib.util.spec_from_file_location(f"zero960_uci_benchmark_{path.stem}", path)
49
+ if spec is None or spec.loader is None:
50
+ raise RuntimeError(f"failed to load module from {path}")
51
+
52
+ module = importlib.util.module_from_spec(spec)
53
+ spec.loader.exec_module(module)
54
+ evaluate = getattr(module, "evaluate", None)
55
+ if evaluate is None or not callable(evaluate):
56
+ raise RuntimeError(f"{path} does not define evaluate(board)")
57
+ return evaluate
58
+
59
+
60
+ def _sample_positions(count: int, seed: int) -> list[int]:
61
+ rng = random.Random(seed)
62
+ population = list(range(960))
63
+ if count <= len(population):
64
+ return rng.sample(population, count)
65
+ return [rng.choice(population) for _ in range(count)]
66
+
67
+
68
+ def _elo_from_score(score: float) -> float:
69
+ clipped = min(max(score, 0.01), 0.99)
70
+ return -400.0 * math.log10((1.0 / clipped) - 1.0)
71
+
72
+
73
+ def _new_board(chess960_index: int) -> chess.Board:
74
+ board = chess.Board.from_chess960_pos(chess960_index)
75
+ board.chess960 = True
76
+ return board
77
+
78
+
79
+ def _engine_limit(depth: int | None, nodes: int | None) -> chess.engine.Limit:
80
+ if depth is not None:
81
+ return chess.engine.Limit(depth=depth)
82
+ if nodes is not None:
83
+ return chess.engine.Limit(nodes=nodes)
84
+ raise ValueError("expected depth or nodes limit")
85
+
86
+
87
+ def _parse_option_value(raw_value: str) -> bool | int | float | str:
88
+ lowered = raw_value.lower()
89
+ if lowered in {"true", "false"}:
90
+ return lowered == "true"
91
+ try:
92
+ return int(raw_value)
93
+ except ValueError:
94
+ pass
95
+ try:
96
+ return float(raw_value)
97
+ except ValueError:
98
+ pass
99
+ return raw_value
100
+
101
+
102
+ def _parse_engine_options(pairs: list[str]) -> dict[str, bool | int | float | str]:
103
+ options: dict[str, bool | int | float | str] = {}
104
+ for pair in pairs:
105
+ if "=" not in pair:
106
+ raise ValueError(f"invalid --engine-option {pair!r}; expected NAME=VALUE")
107
+ name, raw_value = pair.split("=", 1)
108
+ option_name = name.strip()
109
+ if not option_name:
110
+ raise ValueError(f"invalid --engine-option {pair!r}; missing option name")
111
+ options[option_name] = _parse_option_value(raw_value.strip())
112
+ return options
113
+
114
+
115
+ def _play_game_vs_engine(
116
+ chess960_index: int,
117
+ candidate_eval: EvalFn,
118
+ engine: chess.engine.SimpleEngine,
119
+ *,
120
+ candidate_is_white: bool,
121
+ candidate_depth: int | None,
122
+ candidate_nodes: int | None,
123
+ engine_depth: int | None,
124
+ engine_nodes: int | None,
125
+ max_plies: int,
126
+ ) -> float:
127
+ board = _new_board(chess960_index)
128
+ candidate_limit = _engine_limit(candidate_depth, candidate_nodes)
129
+ opponent_limit = _engine_limit(engine_depth, engine_nodes)
130
+
131
+ for _ in range(max_plies):
132
+ if board.is_game_over(claim_draw=True):
133
+ break
134
+
135
+ candidate_turn = board.turn == chess.WHITE if candidate_is_white else board.turn == chess.BLACK
136
+ if candidate_turn:
137
+ if candidate_limit.depth is not None:
138
+ move = select_move(board, depth=candidate_limit.depth, eval_fn=candidate_eval)
139
+ else:
140
+ raise ValueError("candidate_nodes is not supported by the local engine path")
141
+ else:
142
+ result = engine.play(board, opponent_limit)
143
+ move = result.move
144
+ if move is None:
145
+ raise RuntimeError("UCI engine returned no move")
146
+
147
+ board.push(move)
148
+
149
+ result = board.result(claim_draw=True)
150
+ if result == "1-0":
151
+ return 1.0 if candidate_is_white else 0.0
152
+ if result == "0-1":
153
+ return 0.0 if candidate_is_white else 1.0
154
+ return 0.5
155
+
156
+
157
+ def benchmark_eval_vs_uci(
158
+ candidate_path: Path,
159
+ engine_command: str,
160
+ *,
161
+ engine_options: dict[str, bool | int | float | str] | None = None,
162
+ positions: int = 32,
163
+ candidate_depth: int = 2,
164
+ candidate_nodes: int | None = None,
165
+ engine_depth: int = 1,
166
+ engine_nodes: int | None = None,
167
+ max_plies: int = 120,
168
+ seed: int = 42,
169
+ ) -> UciBenchmarkResult:
170
+ candidate_eval = _load_eval(candidate_path)
171
+ start_positions = _sample_positions(positions, seed)
172
+ configured_engine_options = dict(engine_options or {})
173
+
174
+ wins = 0
175
+ draws = 0
176
+ losses = 0
177
+ points = 0.0
178
+
179
+ with chess.engine.SimpleEngine.popen_uci(engine_command) as engine:
180
+ if configured_engine_options:
181
+ engine.configure(configured_engine_options)
182
+ for chess960_index in start_positions:
183
+ white_result = _play_game_vs_engine(
184
+ chess960_index,
185
+ candidate_eval,
186
+ engine,
187
+ candidate_is_white=True,
188
+ candidate_depth=candidate_depth,
189
+ candidate_nodes=candidate_nodes,
190
+ engine_depth=engine_depth,
191
+ engine_nodes=engine_nodes,
192
+ max_plies=max_plies,
193
+ )
194
+ points += white_result
195
+ if white_result == 1.0:
196
+ wins += 1
197
+ elif white_result == 0.5:
198
+ draws += 1
199
+ else:
200
+ losses += 1
201
+
202
+ black_result = _play_game_vs_engine(
203
+ chess960_index,
204
+ candidate_eval,
205
+ engine,
206
+ candidate_is_white=False,
207
+ candidate_depth=candidate_depth,
208
+ candidate_nodes=candidate_nodes,
209
+ engine_depth=engine_depth,
210
+ engine_nodes=engine_nodes,
211
+ max_plies=max_plies,
212
+ )
213
+ points += black_result
214
+ if black_result == 1.0:
215
+ wins += 1
216
+ elif black_result == 0.5:
217
+ draws += 1
218
+ else:
219
+ losses += 1
220
+
221
+ total_games = len(start_positions) * 2
222
+ score = points / total_games if total_games else 0.0
223
+ return UciBenchmarkResult(
224
+ candidate_path=candidate_path,
225
+ engine_command=engine_command,
226
+ engine_options=configured_engine_options,
227
+ positions=len(start_positions),
228
+ max_plies=max_plies,
229
+ seed=seed,
230
+ candidate_depth=candidate_depth,
231
+ candidate_nodes=candidate_nodes,
232
+ engine_depth=engine_depth,
233
+ engine_nodes=engine_nodes,
234
+ wins=wins,
235
+ draws=draws,
236
+ losses=losses,
237
+ points=points,
238
+ total_games=total_games,
239
+ score=score,
240
+ elo_delta_estimate=_elo_from_score(score),
241
+ )
242
+
243
+
244
+ def parse_args() -> argparse.Namespace:
245
+ root = Path(__file__).resolve().parents[1]
246
+ parser = argparse.ArgumentParser(description="Benchmark a local eval file against a UCI engine.")
247
+ parser.add_argument(
248
+ "--candidate-file",
249
+ default=str(root / "src/zero960/workspace_template/eval.py"),
250
+ help="Path to the candidate eval.py file.",
251
+ )
252
+ parser.add_argument(
253
+ "--engine-command",
254
+ default="stockfish",
255
+ help="UCI engine command, for example 'stockfish'.",
256
+ )
257
+ parser.add_argument(
258
+ "--engine-option",
259
+ action="append",
260
+ default=[],
261
+ help="Repeated engine option in NAME=VALUE form, for example UCI_LimitStrength=true.",
262
+ )
263
+ parser.add_argument("--positions", type=int, default=32)
264
+ parser.add_argument("--candidate-depth", type=int, default=2)
265
+ parser.add_argument("--candidate-nodes", type=int, default=None)
266
+ parser.add_argument("--engine-depth", type=int, default=1)
267
+ parser.add_argument("--engine-nodes", type=int, default=None)
268
+ parser.add_argument("--max-plies", type=int, default=120)
269
+ parser.add_argument("--seed", type=int, default=42)
270
+ return parser.parse_args()
271
+
272
+
273
+ def main() -> None:
274
+ args = parse_args()
275
+ candidate_path = Path(args.candidate_file).resolve()
276
+ engine_options = _parse_engine_options(args.engine_option)
277
+ result = benchmark_eval_vs_uci(
278
+ candidate_path,
279
+ args.engine_command,
280
+ engine_options=engine_options,
281
+ positions=args.positions,
282
+ candidate_depth=args.candidate_depth,
283
+ candidate_nodes=args.candidate_nodes,
284
+ engine_depth=args.engine_depth,
285
+ engine_nodes=args.engine_nodes,
286
+ max_plies=args.max_plies,
287
+ seed=args.seed,
288
+ )
289
+
290
+ print(f"candidate: {result.candidate_path}")
291
+ print(f"engine: {result.engine_command}")
292
+ if result.engine_options:
293
+ print(f"engine_options={result.engine_options}")
294
+ print(
295
+ f"positions={result.positions} max_plies={result.max_plies} games={result.total_games} seed={result.seed} "
296
+ f"candidate_depth={result.candidate_depth} engine_depth={result.engine_depth} "
297
+ f"candidate_nodes={result.candidate_nodes} engine_nodes={result.engine_nodes}"
298
+ )
299
+ print(
300
+ f"record={result.wins}-{result.draws}-{result.losses} "
301
+ f"points={result.points:.1f}/{result.total_games}"
302
+ )
303
+ print(f"score={result.score:.3f} elo_delta_estimate={result.elo_delta_estimate:.1f}")
304
+
305
+
306
+ if __name__ == "__main__":
307
+ main()
train/build_dashboard.py ADDED
@@ -0,0 +1,656 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Build a self-contained HTML dashboard for swarm and benchmark results."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import json
7
+ import tempfile
8
+ from dataclasses import asdict, dataclass
9
+ from datetime import datetime
10
+ from pathlib import Path
11
+
12
+ from train.benchmark_engine import benchmark_engine_roots
13
+ from train.benchmark_league import benchmark_league, default_league_opponents
14
+ from train.benchmark_uci import benchmark_eval_vs_uci
15
+
16
+
17
+ @dataclass(slots=True)
18
+ class DashboardData:
19
+ generated_at: str
20
+ current_champion: str
21
+ accepted_count: int
22
+ all_results: list[dict[str, object]]
23
+ accepted_results: list[dict[str, object]]
24
+ engine_progress: dict[str, object] | None
25
+ league: dict[str, object] | None
26
+ stockfish_anchors: list[dict[str, object]]
27
+
28
+ def to_json(self) -> dict[str, object]:
29
+ return asdict(self)
30
+
31
+
32
+ def _repo_root() -> Path:
33
+ return Path(__file__).resolve().parents[1]
34
+
35
+
36
+ def _load_jsonl(path: Path) -> list[dict[str, object]]:
37
+ if not path.exists():
38
+ return []
39
+ rows: list[dict[str, object]] = []
40
+ for line in path.read_text(encoding="utf-8").splitlines():
41
+ line = line.strip()
42
+ if not line:
43
+ continue
44
+ rows.append(json.loads(line))
45
+ return rows
46
+
47
+
48
+ def _short_summary(summary: str, limit: int = 180) -> str:
49
+ compact = " ".join(summary.split())
50
+ if len(compact) <= limit:
51
+ return compact
52
+ return compact[: limit - 3] + "..."
53
+
54
+
55
+ def _normalize_result(entry: dict[str, object]) -> dict[str, object]:
56
+ benchmark = entry.get("benchmark") or {}
57
+ round_dir = str(entry.get("round_dir", ""))
58
+ round_name = Path(round_dir).name if round_dir else "unknown"
59
+ return {
60
+ "worker_name": entry.get("worker_name"),
61
+ "accepted": bool(entry.get("accepted")),
62
+ "winner": bool(entry.get("winner")),
63
+ "round_name": round_name,
64
+ "score": benchmark.get("score"),
65
+ "elo_delta_estimate": benchmark.get("elo_delta_estimate"),
66
+ "wins": benchmark.get("wins"),
67
+ "draws": benchmark.get("draws"),
68
+ "losses": benchmark.get("losses"),
69
+ "points": benchmark.get("points"),
70
+ "total_games": benchmark.get("total_games"),
71
+ "candidate_file": entry.get("candidate_file"),
72
+ "summary": _short_summary(str(entry.get("summary", ""))),
73
+ "surface": entry.get("surface", "eval"),
74
+ }
75
+
76
+
77
+ def _copy_file(src: Path, dst: Path) -> None:
78
+ dst.parent.mkdir(parents=True, exist_ok=True)
79
+ dst.write_text(src.read_text(encoding="utf-8"), encoding="utf-8")
80
+
81
+
82
+ def _build_engine_progress(
83
+ root: Path,
84
+ champion_eval_path: Path,
85
+ *,
86
+ baseline_root: Path,
87
+ positions: int,
88
+ depth: int,
89
+ max_plies: int,
90
+ seed: int,
91
+ ) -> dict[str, object] | None:
92
+ if not baseline_root.exists():
93
+ return None
94
+ baseline_eval = baseline_root / "src" / "zero960" / "workspace_template" / "eval.py"
95
+ baseline_search = baseline_root / "src" / "zero960" / "engine" / "search.py"
96
+ if not baseline_eval.exists() or not baseline_search.exists():
97
+ return None
98
+
99
+ with tempfile.TemporaryDirectory(prefix="0x960-dashboard-engine-") as temp_dir:
100
+ candidate_root = Path(temp_dir)
101
+ _copy_file(
102
+ champion_eval_path,
103
+ candidate_root / "src" / "zero960" / "workspace_template" / "eval.py",
104
+ )
105
+ _copy_file(
106
+ root / "src" / "zero960" / "engine" / "search.py",
107
+ candidate_root / "src" / "zero960" / "engine" / "search.py",
108
+ )
109
+ result = benchmark_engine_roots(
110
+ candidate_root,
111
+ baseline_root,
112
+ positions=positions,
113
+ depth=depth,
114
+ max_plies=max_plies,
115
+ seed=seed,
116
+ )
117
+ return {
118
+ "label": "Current engine vs search baseline",
119
+ "candidate_eval_path": str(champion_eval_path),
120
+ "candidate_search_path": str((root / "src" / "zero960" / "engine" / "search.py").resolve()),
121
+ "baseline_root": str(baseline_root),
122
+ "result": result.to_json(),
123
+ }
124
+
125
+
126
+ def _build_stockfish_anchors(
127
+ candidate_path: Path,
128
+ *,
129
+ positions: int,
130
+ candidate_depth: int,
131
+ engine_depth: int,
132
+ max_plies: int,
133
+ seed: int,
134
+ engine_command: str,
135
+ anchor_elos: list[int],
136
+ ) -> list[dict[str, object]]:
137
+ rows: list[dict[str, object]] = []
138
+ for elo in anchor_elos:
139
+ result = benchmark_eval_vs_uci(
140
+ candidate_path,
141
+ engine_command,
142
+ engine_options={"UCI_LimitStrength": True, "UCI_Elo": elo},
143
+ positions=positions,
144
+ candidate_depth=candidate_depth,
145
+ engine_depth=engine_depth,
146
+ max_plies=max_plies,
147
+ seed=seed,
148
+ )
149
+ rows.append(
150
+ {
151
+ "label": f"Stockfish {elo}",
152
+ "uci_elo": elo,
153
+ "score": result.score,
154
+ "elo_delta_estimate": result.elo_delta_estimate,
155
+ "wins": result.wins,
156
+ "draws": result.draws,
157
+ "losses": result.losses,
158
+ "points": result.points,
159
+ "total_games": result.total_games,
160
+ }
161
+ )
162
+ return rows
163
+
164
+
165
+ def _build_dashboard_data(args: argparse.Namespace) -> DashboardData:
166
+ root = _repo_root()
167
+ ledger_path = root / "outputs" / "codex_swarm" / "ledger.jsonl"
168
+ champion_path = Path(args.candidate_file).resolve()
169
+ ledger_rows = _load_jsonl(ledger_path)
170
+ normalized_rows = [_normalize_result(row) for row in ledger_rows if row.get("benchmark") is not None]
171
+ accepted_rows = [row for row in normalized_rows if row["accepted"]]
172
+
173
+ league_payload: dict[str, object] | None = None
174
+ opponents = default_league_opponents(
175
+ candidate_path=champion_path,
176
+ include_baseline=True,
177
+ include_champion=True,
178
+ accepted_limit=args.league_accepted_limit,
179
+ )
180
+ if opponents:
181
+ league_result = benchmark_league(
182
+ champion_path,
183
+ opponents,
184
+ positions=args.league_positions,
185
+ depth=args.depth,
186
+ max_plies=args.max_plies,
187
+ seed=args.seed,
188
+ )
189
+ league_payload = league_result.to_json()
190
+
191
+ stockfish_rows: list[dict[str, object]] = []
192
+ engine_progress: dict[str, object] | None = None
193
+ if args.include_engine_progress:
194
+ engine_progress = _build_engine_progress(
195
+ root,
196
+ champion_path,
197
+ baseline_root=Path(args.engine_baseline_root).resolve(),
198
+ positions=args.engine_positions,
199
+ depth=args.depth,
200
+ max_plies=args.max_plies,
201
+ seed=args.seed,
202
+ )
203
+ if args.include_stockfish:
204
+ stockfish_rows = _build_stockfish_anchors(
205
+ champion_path,
206
+ positions=args.stockfish_positions,
207
+ candidate_depth=args.depth,
208
+ engine_depth=args.stockfish_depth,
209
+ max_plies=args.max_plies,
210
+ seed=args.seed,
211
+ engine_command=args.engine_command,
212
+ anchor_elos=args.stockfish_elo,
213
+ )
214
+
215
+ return DashboardData(
216
+ generated_at=datetime.now().isoformat(timespec="seconds"),
217
+ current_champion=str(champion_path),
218
+ accepted_count=len(accepted_rows),
219
+ all_results=normalized_rows,
220
+ accepted_results=accepted_rows,
221
+ engine_progress=engine_progress,
222
+ league=league_payload,
223
+ stockfish_anchors=stockfish_rows,
224
+ )
225
+
226
+
227
+ def _dashboard_html(payload: dict[str, object]) -> str:
228
+ data_json = json.dumps(payload)
229
+ template = """<!doctype html>
230
+ <html lang="en">
231
+ <head>
232
+ <meta charset="utf-8">
233
+ <meta name="viewport" content="width=device-width, initial-scale=1">
234
+ <title>0x960 Dashboard</title>
235
+ <style>
236
+ :root {{
237
+ --bg: #0d1117;
238
+ --panel: #151b23;
239
+ --panel-2: #1d2733;
240
+ --text: #e6edf3;
241
+ --muted: #9fb0c0;
242
+ --green: #3fb950;
243
+ --red: #f85149;
244
+ --amber: #d29922;
245
+ --blue: #58a6ff;
246
+ --border: #2d3a49;
247
+ }}
248
+ * {{ box-sizing: border-box; }}
249
+ body {{
250
+ margin: 0;
251
+ font-family: ui-sans-serif, -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
252
+ background:
253
+ radial-gradient(circle at top left, rgba(88,166,255,0.14), transparent 28%),
254
+ radial-gradient(circle at top right, rgba(63,185,80,0.12), transparent 22%),
255
+ linear-gradient(180deg, #0b1016 0%, var(--bg) 100%);
256
+ color: var(--text);
257
+ }}
258
+ .wrap {{
259
+ width: min(1200px, calc(100vw - 32px));
260
+ margin: 0 auto;
261
+ padding: 28px 0 48px;
262
+ }}
263
+ h1, h2, h3, p {{ margin: 0; }}
264
+ .hero {{
265
+ display: grid;
266
+ gap: 12px;
267
+ margin-bottom: 20px;
268
+ }}
269
+ .hero p {{ color: var(--muted); }}
270
+ .grid {{
271
+ display: grid;
272
+ grid-template-columns: repeat(12, 1fr);
273
+ gap: 16px;
274
+ }}
275
+ .card {{
276
+ background: linear-gradient(180deg, rgba(255,255,255,0.02), rgba(255,255,255,0.01));
277
+ border: 1px solid var(--border);
278
+ border-radius: 18px;
279
+ padding: 18px;
280
+ backdrop-filter: blur(8px);
281
+ box-shadow: 0 18px 50px rgba(0,0,0,0.22);
282
+ }}
283
+ .span-3 {{ grid-column: span 3; }}
284
+ .span-4 {{ grid-column: span 4; }}
285
+ .span-5 {{ grid-column: span 5; }}
286
+ .span-6 {{ grid-column: span 6; }}
287
+ .span-7 {{ grid-column: span 7; }}
288
+ .span-8 {{ grid-column: span 8; }}
289
+ .span-12 {{ grid-column: span 12; }}
290
+ .kpi-label {{ color: var(--muted); font-size: 13px; margin-bottom: 8px; }}
291
+ .kpi-value {{ font-size: 34px; font-weight: 700; letter-spacing: -0.03em; }}
292
+ .kpi-sub {{ color: var(--muted); margin-top: 6px; font-size: 13px; }}
293
+ .section-title {{ font-size: 18px; margin-bottom: 14px; }}
294
+ .chart {{ width: 100%; height: 280px; }}
295
+ .bars .row, .table-row {{
296
+ display: grid;
297
+ gap: 12px;
298
+ align-items: center;
299
+ }}
300
+ .bars .row {{
301
+ grid-template-columns: 190px 1fr 72px;
302
+ margin-bottom: 10px;
303
+ }}
304
+ .bar-track {{
305
+ height: 12px;
306
+ border-radius: 999px;
307
+ background: rgba(255,255,255,0.08);
308
+ overflow: hidden;
309
+ }}
310
+ .bar-fill {{
311
+ height: 100%;
312
+ border-radius: 999px;
313
+ background: linear-gradient(90deg, var(--blue), #8ed0ff);
314
+ }}
315
+ .good {{ color: var(--green); }}
316
+ .bad {{ color: var(--red); }}
317
+ .muted {{ color: var(--muted); }}
318
+ .table-head, .table-row {{
319
+ grid-template-columns: 126px 70px 82px 120px 1fr;
320
+ font-size: 13px;
321
+ padding: 10px 0;
322
+ border-bottom: 1px solid rgba(255,255,255,0.06);
323
+ }}
324
+ .table-head {{
325
+ color: var(--muted);
326
+ text-transform: uppercase;
327
+ letter-spacing: 0.06em;
328
+ font-size: 11px;
329
+ }}
330
+ .pill {{
331
+ display: inline-block;
332
+ border-radius: 999px;
333
+ padding: 4px 10px;
334
+ font-size: 11px;
335
+ font-weight: 700;
336
+ letter-spacing: 0.04em;
337
+ text-transform: uppercase;
338
+ background: rgba(255,255,255,0.08);
339
+ }}
340
+ .pill.win {{ background: rgba(63,185,80,0.16); color: var(--green); }}
341
+ .pill.loss {{ background: rgba(248,81,73,0.14); color: var(--red); }}
342
+ .pill.flat {{ background: rgba(210,153,34,0.14); color: var(--amber); }}
343
+ .league-list {{
344
+ display: grid;
345
+ gap: 12px;
346
+ }}
347
+ .league-item {{
348
+ display: grid;
349
+ grid-template-columns: 1fr auto auto;
350
+ gap: 12px;
351
+ padding: 12px 0;
352
+ border-bottom: 1px solid rgba(255,255,255,0.06);
353
+ }}
354
+ .footer {{
355
+ margin-top: 16px;
356
+ font-size: 12px;
357
+ color: var(--muted);
358
+ }}
359
+ @media (max-width: 900px) {{
360
+ .span-3, .span-4, .span-5, .span-6, .span-7, .span-8, .span-12 {{
361
+ grid-column: span 12;
362
+ }}
363
+ .bars .row {{ grid-template-columns: 1fr; }}
364
+ .table-head, .table-row {{ grid-template-columns: 1fr; gap: 6px; }}
365
+ .league-item {{ grid-template-columns: 1fr; }}
366
+ }}
367
+ </style>
368
+ </head>
369
+ <body>
370
+ <div class="wrap">
371
+ <div class="hero">
372
+ <h1>0x960 Engine Dashboard</h1>
373
+ <p>Swarm progress, internal Elo deltas, league self-play, and optional Stockfish anchors in one static page.</p>
374
+ </div>
375
+ <div class="grid" id="app"></div>
376
+ <div class="footer" id="footer"></div>
377
+ </div>
378
+ <script type="application/json" id="dashboard-data">__DASHBOARD_JSON__</script>
379
+ <script>
380
+ const data = JSON.parse(document.getElementById('dashboard-data').textContent);
381
+ const app = document.getElementById('app');
382
+ const footer = document.getElementById('footer');
383
+
384
+ const accepted = data.accepted_results || [];
385
+ const league = data.league;
386
+ const anchors = data.stockfish_anchors || [];
387
+ const engineProgress = data.engine_progress;
388
+ const all = data.all_results || [];
389
+
390
+ const latestAccepted = accepted.length ? accepted[accepted.length - 1] : null;
391
+ const bestAccepted = accepted.length
392
+ ? accepted.reduce((best, row) => (row.score > best.score ? row : best), accepted[0])
393
+ : null;
394
+ const bestRejected = all.filter((row) => !row.accepted && row.score !== null).reduce((best, row) => {
395
+ if (!best || row.score > best.score) return row;
396
+ return best;
397
+ }, null);
398
+
399
+ function card(cls, inner) {{
400
+ const el = document.createElement('section');
401
+ el.className = `card ${cls}`;
402
+ el.innerHTML = inner;
403
+ return el;
404
+ }}
405
+
406
+ function scoreClass(value) {{
407
+ if (value > 0.5) return 'good';
408
+ if (value < 0.5) return 'bad';
409
+ return 'muted';
410
+ }}
411
+
412
+ function eloClass(value) {{
413
+ if (value > 0) return 'good';
414
+ if (value < 0) return 'bad';
415
+ return 'muted';
416
+ }}
417
+
418
+ const kpis = [
419
+ {{
420
+ label: 'Accepted Champions',
421
+ value: String(data.accepted_count),
422
+ sub: latestAccepted ? `Latest: ${latestAccepted.worker_name}` : 'No accepted challenger yet'
423
+ }},
424
+ {{
425
+ label: 'Current Internal Score',
426
+ value: latestAccepted ? latestAccepted.score.toFixed(3) : 'n/a',
427
+ sub: latestAccepted ? `vs previous champion` : 'Awaiting accepted run'
428
+ }},
429
+ {{
430
+ label: 'Current Internal Elo',
431
+ value: latestAccepted ? `${latestAccepted.elo_delta_estimate.toFixed(1)}` : 'n/a',
432
+ sub: latestAccepted ? 'delta vs prior champion' : 'Awaiting accepted run'
433
+ }},
434
+ {{
435
+ label: 'League Score',
436
+ value: league ? league.overall_score.toFixed(3) : 'n/a',
437
+ sub: league ? `${league.total_points.toFixed(1)}/${league.total_games} points` : 'League not available'
438
+ }}
439
+ ];
440
+
441
+ if (engineProgress) {{
442
+ kpis.push({{
443
+ label: 'Search Gain',
444
+ value: `${{engineProgress.result.elo_delta_estimate.toFixed(1)}}`,
445
+ sub: `${{engineProgress.result.points.toFixed(1)}}/${{engineProgress.result.total_games}} vs baseline`
446
+ }});
447
+ }}
448
+
449
+ for (const kpi of kpis) {{
450
+ app.appendChild(card('span-3', `
451
+ <div class="kpi-label">${{kpi.label}}</div>
452
+ <div class="kpi-value">${{kpi.value}}</div>
453
+ <div class="kpi-sub">${{kpi.sub}}</div>
454
+ `));
455
+ }}
456
+
457
+ function lineChart(rows) {{
458
+ if (!rows.length) {{
459
+ return '<p class="muted">No accepted results yet.</p>';
460
+ }}
461
+ const width = 640;
462
+ const height = 260;
463
+ const padding = 28;
464
+ const xs = rows.map((_, index) => padding + (index * (width - padding * 2) / Math.max(rows.length - 1, 1)));
465
+ const ys = rows.map((row) => {{
466
+ const score = row.score ?? 0.5;
467
+ return height - padding - ((score - 0.35) / 0.35) * (height - padding * 2);
468
+ }});
469
+ const points = xs.map((x, index) => `${{x}},${{ys[index]}}`).join(' ');
470
+ const circles = xs.map((x, index) =>
471
+ `<circle cx="${{x}}" cy="${{ys[index]}}" r="5" fill="#58a6ff"><title>${{rows[index].worker_name}}: ${{rows[index].score.toFixed(3)}}</title></circle>`
472
+ ).join('');
473
+ return `
474
+ <svg viewBox="0 0 ${{width}} ${{height}}" class="chart" role="img" aria-label="Accepted score progression">
475
+ <line x1="${{padding}}" y1="${{height - padding}}" x2="${{width - padding}}" y2="${{height - padding}}" stroke="rgba(255,255,255,0.18)" />
476
+ <line x1="${{padding}}" y1="${{padding}}" x2="${{padding}}" y2="${{height - padding}}" stroke="rgba(255,255,255,0.18)" />
477
+ <line x1="${{padding}}" y1="${{height - padding - ((0.5 - 0.35) / 0.35) * (height - padding * 2)}}" x2="${{width - padding}}" y2="${{height - padding - ((0.5 - 0.35) / 0.35) * (height - padding * 2)}}" stroke="rgba(210,153,34,0.35)" stroke-dasharray="4 4" />
478
+ <polyline fill="none" stroke="#58a6ff" stroke-width="3" points="${{points}}" />
479
+ ${{circles}}
480
+ </svg>
481
+ `;
482
+ }}
483
+
484
+ app.appendChild(card('span-7', `
485
+ <h2 class="section-title">Accepted Score Progression</h2>
486
+ ${{lineChart(accepted)}}
487
+ `));
488
+
489
+ const summaryRows = [
490
+ latestAccepted ? `<div class="league-item"><div><strong>Latest winner</strong><div class="muted">${{latestAccepted.worker_name}} in ${{latestAccepted.round_name}}</div></div><div class="${{eloClass(latestAccepted.elo_delta_estimate)}}">${{latestAccepted.elo_delta_estimate.toFixed(1)}} Elo</div><div class="${{scoreClass(latestAccepted.score)}}">${{latestAccepted.score.toFixed(3)}} score</div></div>` : '',
491
+ bestAccepted ? `<div class="league-item"><div><strong>Best accepted score</strong><div class="muted">${{bestAccepted.worker_name}}</div></div><div class="${{eloClass(bestAccepted.elo_delta_estimate)}}">${{bestAccepted.elo_delta_estimate.toFixed(1)}} Elo</div><div class="${{scoreClass(bestAccepted.score)}}">${{bestAccepted.score.toFixed(3)}} score</div></div>` : '',
492
+ bestRejected ? `<div class="league-item"><div><strong>Best rejected try</strong><div class="muted">${{bestRejected.worker_name}} in ${{bestRejected.round_name}}</div></div><div class="${{eloClass(bestRejected.elo_delta_estimate)}}">${{bestRejected.elo_delta_estimate.toFixed(1)}} Elo</div><div class="${{scoreClass(bestRejected.score)}}">${{bestRejected.score.toFixed(3)}} score</div></div>` : ''
493
+ ].join('');
494
+
495
+ app.appendChild(card('span-5', `
496
+ <h2 class="section-title">Swarm Snapshot</h2>
497
+ <div class="league-list">${{summaryRows || '<p class="muted">No benchmark rows yet.</p>'}}</div>
498
+ `));
499
+
500
+ if (engineProgress) {{
501
+ app.appendChild(card('span-12', `
502
+ <h2 class="section-title">Engine Search Progress</h2>
503
+ <div class="league-list">
504
+ <div class="league-item">
505
+ <div>
506
+ <strong>${{engineProgress.label}}</strong>
507
+ <div class="muted">${{engineProgress.result.wins}}-${{engineProgress.result.draws}}-${{engineProgress.result.losses}}</div>
508
+ </div>
509
+ <div class="${{scoreClass(engineProgress.result.score)}}">${{engineProgress.result.score.toFixed(3)}} score</div>
510
+ <div class="${{eloClass(engineProgress.result.elo_delta_estimate)}}">${{engineProgress.result.elo_delta_estimate.toFixed(1)}} Elo</div>
511
+ </div>
512
+ </div>
513
+ <div class="kpi-sub" style="margin-top: 10px;">
514
+ Candidate search: ${{engineProgress.candidate_search_path}}<br>
515
+ Baseline root: ${{engineProgress.baseline_root}}
516
+ </div>
517
+ `));
518
+ }}
519
+
520
+ function barRows(rows, key, formatter) {{
521
+ if (!rows.length) {{
522
+ return '<p class="muted">No data yet.</p>';
523
+ }}
524
+ const values = rows.map((row) => Math.abs(row[key] ?? 0));
525
+ const max = Math.max(...values, 1);
526
+ return rows.map((row) => {{
527
+ const value = row[key] ?? 0;
528
+ const width = Math.max(6, Math.round(Math.abs(value) / max * 100));
529
+ const cls = value > 0 ? 'good' : value < 0 ? 'bad' : 'muted';
530
+ const fill = value > 0 ? 'var(--green)' : value < 0 ? 'var(--red)' : 'var(--amber)';
531
+ return `
532
+ <div class="row">
533
+ <div>${{row.worker_name}}</div>
534
+ <div class="bar-track"><div class="bar-fill" style="width:${{width}}%; background:${{fill}}"></div></div>
535
+ <div class="${{cls}}">${{formatter(value)}}</div>
536
+ </div>
537
+ `;
538
+ }}).join('');
539
+ }}
540
+
541
+ app.appendChild(card('span-6', `
542
+ <h2 class="section-title">Accepted Internal Elo Deltas</h2>
543
+ <div class="bars">${{barRows(accepted, 'elo_delta_estimate', (value) => value.toFixed(1))}}</div>
544
+ `));
545
+
546
+ app.appendChild(card('span-6', `
547
+ <h2 class="section-title">League Self-Play</h2>
548
+ ${{
549
+ league
550
+ ? `<div class="league-list">${{league.opponents.map((opponent) => `
551
+ <div class="league-item">
552
+ <div>
553
+ <strong>${{opponent.label}}</strong>
554
+ <div class="muted">${{opponent.result.wins}}-${{opponent.result.draws}}-${{opponent.result.losses}}</div>
555
+ </div>
556
+ <div class="${{scoreClass(opponent.result.score)}}">${{opponent.result.score.toFixed(3)}} score</div>
557
+ <div class="${{eloClass(opponent.result.elo_delta_estimate)}}">${{opponent.result.elo_delta_estimate.toFixed(1)}} Elo</div>
558
+ </div>
559
+ `).join('')}}</div>
560
+ <div class="kpi-sub" style="margin-top: 10px;">Overall: ${{league.overall_score.toFixed(3)}} score, ${{league.overall_elo_delta_estimate.toFixed(1)}} Elo delta estimate</div>`
561
+ : '<p class="muted">League benchmark not available.</p>'
562
+ }}
563
+ `));
564
+
565
+ if (anchors.length) {{
566
+ app.appendChild(card('span-12', `
567
+ <h2 class="section-title">Stockfish Anchor Ladder</h2>
568
+ <div class="bars">${{anchors.map((row) => `
569
+ <div class="row">
570
+ <div>${{row.label}}</div>
571
+ <div class="bar-track"><div class="bar-fill" style="width:${{Math.max(6, Math.round(row.score * 100))}}%; background:${{row.score >= 0.5 ? 'var(--green)' : 'var(--blue)'}}"></div></div>
572
+ <div class="${{scoreClass(row.score)}}">${{row.score.toFixed(3)}}</div>
573
+ </div>
574
+ `).join('')}}</div>
575
+ `));
576
+ }}
577
+
578
+ const rows = all.slice().reverse().map((row) => {{
579
+ const pillClass = row.accepted ? 'win' : (row.score > 0.5 ? 'flat' : 'loss');
580
+ const pillText = row.accepted ? 'accepted' : 'rejected';
581
+ return `
582
+ <div class="table-row">
583
+ <div>${{row.round_name}}</div>
584
+ <div>${{row.worker_name}}</div>
585
+ <div><span class="pill ${{pillClass}}">${{pillText}}</span></div>
586
+ <div class="${{scoreClass(row.score)}}">${{row.score !== null ? row.score.toFixed(3) : 'n/a'}}</div>
587
+ <div class="muted">${{row.summary}}</div>
588
+ </div>
589
+ `;
590
+ }}).join('');
591
+
592
+ app.appendChild(card('span-12', `
593
+ <h2 class="section-title">Recent Swarm Results</h2>
594
+ <div class="table-head">
595
+ <div>Round</div>
596
+ <div>Worker</div>
597
+ <div>Status</div>
598
+ <div>Score</div>
599
+ <div>Summary</div>
600
+ </div>
601
+ ${{rows || '<p class="muted">No swarm results yet.</p>'}}
602
+ `));
603
+
604
+ footer.textContent = `Generated ${{data.generated_at}} | champion: ${{data.current_champion}}`;
605
+ </script>
606
+ </body>
607
+ </html>
608
+ """
609
+ template = template.replace("{{", "{").replace("}}", "}")
610
+ return template.replace("__DASHBOARD_JSON__", data_json)
611
+
612
+
613
+ def parse_args() -> argparse.Namespace:
614
+ root = _repo_root()
615
+ parser = argparse.ArgumentParser(description=__doc__)
616
+ parser.add_argument(
617
+ "--candidate-file",
618
+ default=str(root / "outputs" / "codex_swarm" / "champion_eval.py"),
619
+ help="Candidate file to treat as the current engine in the dashboard.",
620
+ )
621
+ parser.add_argument(
622
+ "--output-dir",
623
+ default=str(root / "outputs" / "dashboard"),
624
+ help="Directory where index.html and dashboard_data.json will be written.",
625
+ )
626
+ parser.add_argument("--depth", type=int, default=2)
627
+ parser.add_argument("--max-plies", type=int, default=120)
628
+ parser.add_argument("--seed", type=int, default=42)
629
+ parser.add_argument("--league-positions", type=int, default=8)
630
+ parser.add_argument("--league-accepted-limit", type=int, default=4)
631
+ parser.add_argument("--include-engine-progress", action="store_true")
632
+ parser.add_argument("--engine-baseline-root", default="/tmp/0x960-search-baseline")
633
+ parser.add_argument("--engine-positions", type=int, default=8)
634
+ parser.add_argument("--include-stockfish", action="store_true")
635
+ parser.add_argument("--engine-command", default="stockfish")
636
+ parser.add_argument("--stockfish-depth", type=int, default=1)
637
+ parser.add_argument("--stockfish-positions", type=int, default=4)
638
+ parser.add_argument("--stockfish-elo", type=int, action="append", default=[1320, 1600])
639
+ return parser.parse_args()
640
+
641
+
642
+ def main() -> None:
643
+ args = parse_args()
644
+ output_dir = Path(args.output_dir).resolve()
645
+ output_dir.mkdir(parents=True, exist_ok=True)
646
+
647
+ payload = _build_dashboard_data(args).to_json()
648
+ (output_dir / "dashboard_data.json").write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n", encoding="utf-8")
649
+ (output_dir / "index.html").write_text(_dashboard_html(payload), encoding="utf-8")
650
+
651
+ print(f"wrote {(output_dir / 'index.html')}")
652
+ print(f"wrote {(output_dir / 'dashboard_data.json')}")
653
+
654
+
655
+ if __name__ == "__main__":
656
+ main()
train/codex_distill.py ADDED
@@ -0,0 +1,332 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Collect teacher trajectories from Codex for 0x960 and emit SFT-ready samples.
2
+
3
+ This script keeps the teacher inside the same bounded action space as the student:
4
+ the model sees the current observation and returns exactly one JSON action per turn.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ import argparse
10
+ import json
11
+ import shutil
12
+ import subprocess
13
+ import tempfile
14
+ import time
15
+ from dataclasses import dataclass
16
+ from pathlib import Path
17
+
18
+ from zero960_env.client import Zero960Client
19
+ from zero960_env.models import Zero960Action, Zero960Observation
20
+
21
+ from train.minimal_trl_openenv import SYSTEM_PROMPT, format_observation_as_prompt
22
+
23
+ ACTION_SCHEMA = {
24
+ "type": "object",
25
+ "additionalProperties": False,
26
+ "properties": {
27
+ "action_type": {
28
+ "type": "string",
29
+ "enum": ["read_file", "write_file", "run_static_eval", "run_match", "finish"],
30
+ },
31
+ "path": {"type": ["string", "null"]},
32
+ "content": {"type": ["string", "null"]},
33
+ },
34
+ "required": ["action_type", "path", "content"],
35
+ }
36
+
37
+ TEACHER_INSTRUCTIONS = """You are the teacher policy for 0x960.
38
+
39
+ Return exactly one JSON action object that matches the provided schema.
40
+
41
+ Constraints:
42
+ - Act only through the bounded action schema. Do not describe actions.
43
+ - Do not use shell commands or external tools.
44
+ - The current eval.py contents are already included in the observation.
45
+ - Prefer the high-reward loop: write_file -> run_match -> finish.
46
+ - Avoid repeated run_static_eval unless it is truly necessary.
47
+ - Always include all three JSON keys: action_type, path, content.
48
+ - Use null for unused fields. Example: {"action_type":"run_match","path":null,"content":null}
49
+ - If you choose write_file, return a full valid replacement for eval.py.
50
+ """
51
+
52
+
53
+ @dataclass(slots=True)
54
+ class TeacherTurn:
55
+ action: Zero960Action
56
+ raw_response: str
57
+ elapsed_s: float
58
+
59
+
60
+ def _action_payload(action: Zero960Action) -> dict:
61
+ return {
62
+ "action_type": action.action_type,
63
+ "path": action.path,
64
+ "content": action.content,
65
+ }
66
+
67
+
68
+ def _find_codex_binary(explicit_path: str | None) -> str:
69
+ if explicit_path:
70
+ return explicit_path
71
+ codex_bin = shutil.which("codex")
72
+ if codex_bin is None:
73
+ raise RuntimeError("codex CLI not found on PATH; install or pass --codex-bin")
74
+ return codex_bin
75
+
76
+
77
+ def _teacher_prompt(observation: Zero960Observation) -> str:
78
+ return (
79
+ f"{TEACHER_INSTRUCTIONS}\n"
80
+ "Use the same environment contract as the student prompt below.\n\n"
81
+ f"System prompt:\n{SYSTEM_PROMPT}\n\n"
82
+ f"Observation:\n{format_observation_as_prompt(observation)}\n"
83
+ )
84
+
85
+
86
+ def _run_codex_turn(
87
+ codex_bin: str,
88
+ model: str,
89
+ workdir: Path,
90
+ observation: Zero960Observation,
91
+ timeout_s: int,
92
+ ) -> TeacherTurn:
93
+ prompt = _teacher_prompt(observation)
94
+
95
+ with tempfile.TemporaryDirectory(prefix="zero960_codex_") as temp_dir_str:
96
+ temp_dir = Path(temp_dir_str)
97
+ schema_path = temp_dir / "action.schema.json"
98
+ output_path = temp_dir / "action.json"
99
+ schema_path.write_text(json.dumps(ACTION_SCHEMA))
100
+
101
+ command = [
102
+ codex_bin,
103
+ "exec",
104
+ "--model",
105
+ model,
106
+ "--cd",
107
+ str(workdir),
108
+ "--ephemeral",
109
+ "--color",
110
+ "never",
111
+ "--output-schema",
112
+ str(schema_path),
113
+ "--output-last-message",
114
+ str(output_path),
115
+ "-",
116
+ ]
117
+
118
+ started = time.time()
119
+ result = subprocess.run(
120
+ command,
121
+ input=prompt,
122
+ text=True,
123
+ capture_output=True,
124
+ timeout=timeout_s,
125
+ check=False,
126
+ )
127
+ elapsed_s = round(time.time() - started, 2)
128
+
129
+ if result.returncode != 0:
130
+ stderr = result.stderr.strip()
131
+ if "refresh_token_reused" in stderr:
132
+ raise RuntimeError(
133
+ "codex auth is stale; run `codex logout` then `codex login` and retry"
134
+ )
135
+ if "usage limit" in stderr.lower():
136
+ raise RuntimeError("codex usage limit reached; stop the batch and retry later")
137
+ raise RuntimeError(f"codex exec failed with exit code {result.returncode}: {stderr}")
138
+ if not output_path.exists():
139
+ raise RuntimeError("codex exec did not write an output message")
140
+
141
+ raw_response = output_path.read_text().strip()
142
+ if not raw_response:
143
+ raise RuntimeError("codex exec returned an empty final message")
144
+ action = Zero960Action.model_validate_json(raw_response)
145
+ return TeacherTurn(action=action, raw_response=raw_response, elapsed_s=elapsed_s)
146
+
147
+
148
+ def _append_jsonl(path: Path, payload: dict) -> None:
149
+ with path.open("a") as handle:
150
+ handle.write(json.dumps(payload, default=str) + "\n")
151
+
152
+
153
+ def _sft_sample(observation: Zero960Observation, action: Zero960Action, metadata: dict) -> dict:
154
+ return {
155
+ "messages": [
156
+ {"role": "system", "content": SYSTEM_PROMPT},
157
+ {"role": "user", "content": format_observation_as_prompt(observation)},
158
+ {"role": "assistant", "content": json.dumps(_action_payload(action))},
159
+ ],
160
+ "metadata": metadata,
161
+ }
162
+
163
+
164
+ def collect_teacher_rollouts(
165
+ base_url: str,
166
+ model: str,
167
+ episodes: int,
168
+ max_turns: int,
169
+ timeout_s: int,
170
+ output_dir: Path,
171
+ min_reward: float,
172
+ codex_bin: str | None,
173
+ ) -> tuple[Path, Path]:
174
+ output_dir.mkdir(parents=True, exist_ok=True)
175
+ trace_path = output_dir / f"teacher_rollouts_{int(time.time())}.jsonl"
176
+ sft_path = output_dir / f"sft_samples_{int(time.time())}.jsonl"
177
+ trace_path.touch()
178
+ sft_path.touch()
179
+
180
+ codex_executable = _find_codex_binary(codex_bin)
181
+ workdir = Path(__file__).resolve().parents[1]
182
+
183
+ with Zero960Client(base_url=base_url) as client:
184
+ stop_reason: str | None = None
185
+ for episode_index in range(episodes):
186
+ reset_result = client.reset()
187
+ observation = reset_result.observation
188
+ episode_turns: list[dict] = []
189
+ forced_finish = False
190
+
191
+ for turn_index in range(max_turns):
192
+ if reset_result.done:
193
+ break
194
+
195
+ pre_action_observation = observation
196
+ try:
197
+ teacher_turn = _run_codex_turn(
198
+ codex_bin=codex_executable,
199
+ model=model,
200
+ workdir=workdir,
201
+ observation=pre_action_observation,
202
+ timeout_s=timeout_s,
203
+ )
204
+ except RuntimeError as exc:
205
+ if "usage limit reached" in str(exc):
206
+ stop_reason = str(exc)
207
+ break
208
+ raise
209
+
210
+ step_result = client.step(teacher_turn.action)
211
+ observation = step_result.observation
212
+
213
+ turn_payload = {
214
+ "episode_index": episode_index,
215
+ "turn_index": turn_index,
216
+ "teacher_model": model,
217
+ "elapsed_s": teacher_turn.elapsed_s,
218
+ "raw_response": teacher_turn.raw_response,
219
+ "action": _action_payload(teacher_turn.action),
220
+ "observation_before": pre_action_observation.model_dump(),
221
+ "observation_after": observation.model_dump(),
222
+ "reward": step_result.reward,
223
+ "done": step_result.done,
224
+ }
225
+ episode_turns.append(turn_payload)
226
+
227
+ if step_result.done:
228
+ reset_result = step_result
229
+ break
230
+ reset_result = step_result
231
+
232
+ if stop_reason is not None:
233
+ break
234
+
235
+ if not reset_result.done:
236
+ forced_finish = True
237
+ finish_result = client.step(Zero960Action(action_type="finish"))
238
+ observation = finish_result.observation
239
+ episode_turns.append(
240
+ {
241
+ "episode_index": episode_index,
242
+ "turn_index": len(episode_turns),
243
+ "teacher_model": model,
244
+ "elapsed_s": 0.0,
245
+ "raw_response": json.dumps({"action_type": "finish"}),
246
+ "action": {"action_type": "finish"},
247
+ "observation_before": reset_result.observation.model_dump(),
248
+ "observation_after": observation.model_dump(),
249
+ "reward": finish_result.reward,
250
+ "done": finish_result.done,
251
+ "forced_finish": True,
252
+ }
253
+ )
254
+ reset_result = finish_result
255
+
256
+ final_reward = float(reset_result.reward or 0.0)
257
+ accepted = (
258
+ final_reward >= min_reward
259
+ and observation.has_valid_edit
260
+ and observation.has_run_match
261
+ )
262
+
263
+ episode_payload = {
264
+ "episode_index": episode_index,
265
+ "teacher_model": model,
266
+ "forced_finish": forced_finish,
267
+ "accepted_for_sft": accepted,
268
+ "final_reward": final_reward,
269
+ "final_status": observation.status_message,
270
+ "turns": episode_turns,
271
+ }
272
+ _append_jsonl(trace_path, episode_payload)
273
+
274
+ if accepted:
275
+ for turn in episode_turns:
276
+ if turn.get("forced_finish"):
277
+ continue
278
+ sample = _sft_sample(
279
+ observation=Zero960Observation.model_validate(turn["observation_before"]),
280
+ action=Zero960Action.model_validate(turn["action"]),
281
+ metadata={
282
+ "episode_index": episode_index,
283
+ "turn_index": turn["turn_index"],
284
+ "teacher_model": model,
285
+ "final_reward": final_reward,
286
+ },
287
+ )
288
+ _append_jsonl(sft_path, sample)
289
+
290
+ print(
291
+ {
292
+ "episode": episode_index,
293
+ "final_reward": final_reward,
294
+ "accepted_for_sft": accepted,
295
+ "turns": len(episode_turns),
296
+ "final_status": observation.status_message,
297
+ }
298
+ )
299
+
300
+ if stop_reason is not None:
301
+ print({"stopped_early": True, "reason": stop_reason})
302
+
303
+ return trace_path, sft_path
304
+
305
+
306
+ def main() -> None:
307
+ parser = argparse.ArgumentParser(description="Collect Codex teacher rollouts for 0x960.")
308
+ parser.add_argument("--base-url", default="http://127.0.0.1:8000")
309
+ parser.add_argument("--model", default="gpt-5.4")
310
+ parser.add_argument("--episodes", type=int, default=20)
311
+ parser.add_argument("--max-turns", type=int, default=6)
312
+ parser.add_argument("--timeout-s", type=int, default=180)
313
+ parser.add_argument("--min-reward", type=float, default=0.4)
314
+ parser.add_argument("--codex-bin", default=None)
315
+ parser.add_argument("--output-dir", default="outputs/codex_distill")
316
+ args = parser.parse_args()
317
+
318
+ trace_path, sft_path = collect_teacher_rollouts(
319
+ base_url=args.base_url,
320
+ model=args.model,
321
+ episodes=args.episodes,
322
+ max_turns=args.max_turns,
323
+ timeout_s=args.timeout_s,
324
+ output_dir=Path(args.output_dir),
325
+ min_reward=args.min_reward,
326
+ codex_bin=args.codex_bin,
327
+ )
328
+ print({"trace_path": str(trace_path), "sft_path": str(sft_path)})
329
+
330
+
331
+ if __name__ == "__main__":
332
+ main()
train/codex_swarm.py ADDED
@@ -0,0 +1,1114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Local Codex swarm coordinator for champion/challenger engine iteration."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import difflib
7
+ import json
8
+ import shutil
9
+ import subprocess
10
+ import time
11
+ from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, TimeoutError, as_completed
12
+ from dataclasses import dataclass
13
+ from datetime import UTC, datetime
14
+ from pathlib import Path
15
+
16
+ from train.benchmark_engine import benchmark_engine_roots
17
+ from train.benchmark_eval import BenchmarkResult, benchmark_eval_files
18
+
19
+ DEFAULT_MODEL = "gpt-5.3-codex"
20
+ DEFAULT_WORKER_COUNT = 5
21
+ DEFAULT_SCREEN_POSITIONS = 8
22
+ DEFAULT_POSITIONS = 32
23
+ DEFAULT_DEPTH = 2
24
+ DEFAULT_MAX_PLIES = 120
25
+ DEFAULT_SEARCH_SCREEN_POSITIONS = 1
26
+ DEFAULT_SEARCH_SCREEN_DEPTH = 1
27
+ DEFAULT_SEARCH_SCREEN_MAX_PLIES = 20
28
+ DEFAULT_FINAL_BENCHMARK_TIMEOUT_SEC = 180
29
+ DEFAULT_SCREEN_MIN_SCORE = 0.52
30
+ DEFAULT_MIN_SCORE = 0.53
31
+ DEFAULT_MAX_DIFF_LINES = 80
32
+ DEFAULT_EDITABLE_FILES = ("src/zero960/workspace_template/eval.py",)
33
+ DEFAULT_WORKER_TIMEOUT_SEC = 600
34
+ DEFAULT_BENCHMARK_TIMEOUT_SEC = 180
35
+ DEFAULT_REFERENCE_PATHS = (
36
+ "AGENTS.md",
37
+ "README.md",
38
+ "docs/codex-swarm-plan.md",
39
+ "docs/process.md",
40
+ "train/benchmark_eval.py",
41
+ "train/benchmark_engine.py",
42
+ "train/benchmark_league.py",
43
+ "train/benchmark_uci.py",
44
+ "src/zero960/engine/default_eval.py",
45
+ "src/zero960/engine/search.py",
46
+ )
47
+ DEFAULT_SYNC_PATHS = (
48
+ *DEFAULT_REFERENCE_PATHS,
49
+ "src/zero960/workspace_template/eval.py",
50
+ )
51
+ DEFAULT_SURFACE = "eval"
52
+ DEFAULT_EVAL_EDITABLE_FILES = ("src/zero960/workspace_template/eval.py",)
53
+ DEFAULT_SEARCH_EDITABLE_FILES = ("src/zero960/engine/search.py",)
54
+ DEFAULT_WORKER_SPECIALIZATIONS = (
55
+ (
56
+ "Structure Researcher",
57
+ "Study Chess960-specific king safety, castling structure, and pawn cover weaknesses around both kings.",
58
+ "_structure_hook",
59
+ ),
60
+ (
61
+ "Pawn-Endgame Researcher",
62
+ "Study pawn structure, passed-pawn pressure, rook-file coordination, and simple endgame conversion bonuses.",
63
+ "_pawn_endgame_hook",
64
+ ),
65
+ (
66
+ "Initiative Tuner",
67
+ "Study tempo, mobility pressure, queen safety, and initiative terms that might convert shallow-search advantages faster.",
68
+ "_initiative_hook",
69
+ ),
70
+ (
71
+ "Activity Researcher",
72
+ "Study piece activity, development, space, and centralization terms that help when search depth is limited.",
73
+ "_activity_hook",
74
+ ),
75
+ (
76
+ "Tactical Safety Researcher",
77
+ "Study loose-piece pressure, attacked-undefended pieces, and tactical safety terms that matter at shallow search depth.",
78
+ "_tactical_hook",
79
+ ),
80
+ )
81
+ DEFAULT_SEARCH_SPECIALIZATIONS = (
82
+ (
83
+ "Move Ordering Researcher",
84
+ "Study move ordering, capture ordering, and cheap heuristics that push strong moves to the front early.",
85
+ "_move_order_score",
86
+ ),
87
+ (
88
+ "Quiescence Researcher",
89
+ "Study tactical horizon control and leaf evaluation extension without exploding the tree.",
90
+ "_quiescence",
91
+ ),
92
+ (
93
+ "Tree Search Researcher",
94
+ "Study alpha-beta search control, pruning safety, and transposition-table usage in the main negamax loop.",
95
+ "negamax",
96
+ ),
97
+ (
98
+ "Root Policy Researcher",
99
+ "Study root move selection, aspiration behavior, and tie-breaking that helps shallow search convert edges.",
100
+ "select_move",
101
+ ),
102
+ (
103
+ "Tactical Move Filter Researcher",
104
+ "Study which tactical moves should survive into the quiescence frontier without causing pointless explosion.",
105
+ "_tactical_moves",
106
+ ),
107
+ )
108
+
109
+
110
+ @dataclass(slots=True)
111
+ class SwarmPaths:
112
+ repo_root: Path
113
+ state_root: Path
114
+ worktree_root: Path
115
+ champion_eval: Path
116
+ champion_search: Path
117
+ ledger_path: Path
118
+
119
+
120
+ @dataclass(slots=True)
121
+ class WorkerResult:
122
+ worker_name: str
123
+ worktree_dir: Path
124
+ round_dir: Path
125
+ prompt_path: Path
126
+ final_message_path: Path
127
+ stdout_path: Path
128
+ stderr_path: Path
129
+ candidate_file: Path
130
+ changed_files: list[str]
131
+ diff_lines_added: int
132
+ diff_lines_deleted: int
133
+ screen_benchmark: BenchmarkResult | None
134
+ benchmark: BenchmarkResult | None
135
+ exit_code: int | None
136
+ accepted: bool
137
+ summary: str
138
+ sandbox_mode: str
139
+
140
+ def to_json(self) -> dict[str, object]:
141
+ return {
142
+ "worker_name": self.worker_name,
143
+ "worktree_dir": str(self.worktree_dir),
144
+ "round_dir": str(self.round_dir),
145
+ "prompt_path": str(self.prompt_path),
146
+ "final_message_path": str(self.final_message_path),
147
+ "stdout_path": str(self.stdout_path),
148
+ "stderr_path": str(self.stderr_path),
149
+ "candidate_file": str(self.candidate_file),
150
+ "changed_files": self.changed_files,
151
+ "diff_lines_added": self.diff_lines_added,
152
+ "diff_lines_deleted": self.diff_lines_deleted,
153
+ "screen_benchmark": None if self.screen_benchmark is None else self.screen_benchmark.to_json(),
154
+ "benchmark": None if self.benchmark is None else self.benchmark.to_json(),
155
+ "exit_code": self.exit_code,
156
+ "accepted": self.accepted,
157
+ "summary": self.summary,
158
+ "sandbox_mode": self.sandbox_mode,
159
+ }
160
+
161
+
162
+ def _repo_root() -> Path:
163
+ return Path(__file__).resolve().parents[1]
164
+
165
+
166
+ def _default_paths() -> SwarmPaths:
167
+ root = _repo_root()
168
+ state_root = root / "outputs" / "codex_swarm"
169
+ return SwarmPaths(
170
+ repo_root=root,
171
+ state_root=state_root,
172
+ worktree_root=Path("/tmp") / "0x960-codex-swarm",
173
+ champion_eval=state_root / "champion_eval.py",
174
+ champion_search=state_root / "champion_search.py",
175
+ ledger_path=state_root / "ledger.jsonl",
176
+ )
177
+
178
+
179
+ def _run(
180
+ args: list[str],
181
+ *,
182
+ cwd: Path,
183
+ capture_output: bool = True,
184
+ check: bool = True,
185
+ input_text: str | None = None,
186
+ ) -> subprocess.CompletedProcess[str]:
187
+ return subprocess.run(
188
+ args,
189
+ cwd=cwd,
190
+ input=input_text,
191
+ text=True,
192
+ capture_output=capture_output,
193
+ check=check,
194
+ )
195
+
196
+
197
+ def _git_output(repo_root: Path, args: list[str]) -> str:
198
+ result = _run(["git", *args], cwd=repo_root)
199
+ return result.stdout.strip()
200
+
201
+
202
+ def _ensure_state_dirs(paths: SwarmPaths) -> None:
203
+ paths.state_root.mkdir(parents=True, exist_ok=True)
204
+ (paths.state_root / "runs").mkdir(parents=True, exist_ok=True)
205
+ (paths.state_root / "accepted").mkdir(parents=True, exist_ok=True)
206
+
207
+
208
+ def _copy_file(src: Path, dst: Path) -> None:
209
+ dst.parent.mkdir(parents=True, exist_ok=True)
210
+ shutil.copy2(src, dst)
211
+
212
+
213
+ def _copy_tree(src: Path, dst: Path) -> None:
214
+ if dst.exists():
215
+ shutil.rmtree(dst, ignore_errors=True)
216
+ shutil.copytree(src, dst)
217
+
218
+
219
+ def _prepare_worker_dir(worker_dir: Path) -> Path:
220
+ if worker_dir.exists() and not any(worker_dir.iterdir()):
221
+ worker_dir.rmdir()
222
+ return worker_dir
223
+
224
+
225
+ def _infer_repo_mode(worker_dir: Path) -> str:
226
+ git_path = worker_dir / ".git"
227
+ if git_path.is_file():
228
+ return "worktree"
229
+ if git_path.is_dir():
230
+ return "clone"
231
+ return "unknown"
232
+
233
+
234
+ def _sync_worker_snapshot(paths: SwarmPaths, worker_dir: Path, sync_paths: tuple[str, ...]) -> None:
235
+ for rel_path in sync_paths:
236
+ src = paths.repo_root / rel_path
237
+ dst = worker_dir / rel_path
238
+ if src.is_file():
239
+ _copy_file(src, dst)
240
+ _copy_file(paths.champion_eval, worker_dir / "src" / "zero960" / "workspace_template" / "eval.py")
241
+ _copy_file(paths.champion_search, worker_dir / "src" / "zero960" / "engine" / "search.py")
242
+ accepted_src = paths.state_root / "accepted"
243
+ accepted_dst = worker_dir / "outputs" / "codex_swarm" / "accepted"
244
+ if accepted_src.exists():
245
+ _copy_tree(accepted_src, accepted_dst)
246
+ else:
247
+ accepted_dst.mkdir(parents=True, exist_ok=True)
248
+ _copy_file(paths.champion_eval, worker_dir / "outputs" / "codex_swarm" / "champion_eval.py")
249
+ _copy_file(paths.champion_search, worker_dir / "outputs" / "codex_swarm" / "champion_search.py")
250
+ ledger_copy = worker_dir / "outputs" / "codex_swarm" / "ledger.jsonl"
251
+ ledger_copy.parent.mkdir(parents=True, exist_ok=True)
252
+ if paths.ledger_path.exists():
253
+ shutil.copy2(paths.ledger_path, ledger_copy)
254
+ else:
255
+ ledger_copy.write_text("", encoding="utf-8")
256
+
257
+
258
+ def _setup_workers(paths: SwarmPaths, worker_count: int, sync_paths: tuple[str, ...]) -> list[tuple[Path, str]]:
259
+ worker_dirs: list[tuple[Path, str]] = []
260
+ paths.worktree_root.mkdir(parents=True, exist_ok=True)
261
+ for worker_index in range(1, worker_count + 1):
262
+ worker_dir = paths.worktree_root / f"worker-{worker_index}"
263
+ sandbox_mode = "existing"
264
+ if not (worker_dir / ".git").exists():
265
+ worker_dir = _prepare_worker_dir(worker_dir)
266
+ try:
267
+ _run(
268
+ ["git", "worktree", "add", "--detach", str(worker_dir), "HEAD"],
269
+ cwd=paths.repo_root,
270
+ )
271
+ sandbox_mode = "worktree"
272
+ except subprocess.CalledProcessError:
273
+ worker_dir = _prepare_worker_dir(worker_dir)
274
+ _run(
275
+ ["git", "clone", "--shared", str(paths.repo_root), str(worker_dir)],
276
+ cwd=paths.repo_root,
277
+ )
278
+ sandbox_mode = "clone"
279
+ else:
280
+ sandbox_mode = _infer_repo_mode(worker_dir)
281
+ _sync_worker_snapshot(paths, worker_dir, sync_paths)
282
+ worker_dirs.append((worker_dir, sandbox_mode))
283
+ return worker_dirs
284
+
285
+
286
+ def _last_ledger_entries(paths: SwarmPaths, limit: int = 5) -> list[dict[str, object]]:
287
+ if not paths.ledger_path.exists():
288
+ return []
289
+ lines = [line for line in paths.ledger_path.read_text(encoding="utf-8").splitlines() if line.strip()]
290
+ entries = [json.loads(line) for line in lines[-limit:]]
291
+ return entries
292
+
293
+
294
+ def _extract_hook_body(champion_text: str, hook_name: str) -> str:
295
+ marker = f"def {hook_name}("
296
+ start = champion_text.find(marker)
297
+ if start == -1:
298
+ return ""
299
+ next_def = champion_text.find("\ndef ", start + len(marker))
300
+ if next_def == -1:
301
+ next_def = len(champion_text)
302
+ return champion_text[start:next_def]
303
+
304
+
305
+ def _hook_state_rank(champion_text: str, hook_name: str) -> int:
306
+ body = _extract_hook_body(champion_text, hook_name)
307
+ if not body:
308
+ return 99
309
+ terminal_lines = [line.strip() for line in body.splitlines() if line.strip()]
310
+ terminal_return = terminal_lines[-1] if terminal_lines else ""
311
+ if terminal_return == "return 0":
312
+ return 0
313
+ if terminal_return.startswith("return _base_"):
314
+ return 1
315
+ return 2
316
+
317
+
318
+ def _ordered_specializations(paths: SwarmPaths, surface: str) -> list[tuple[str, str, str]]:
319
+ if surface == "search":
320
+ return list(DEFAULT_SEARCH_SPECIALIZATIONS)
321
+ champion_text = paths.champion_eval.read_text(encoding="utf-8") if paths.champion_eval.exists() else ""
322
+ return sorted(
323
+ DEFAULT_WORKER_SPECIALIZATIONS,
324
+ key=lambda spec: (_hook_state_rank(champion_text, spec[2]), DEFAULT_WORKER_SPECIALIZATIONS.index(spec)),
325
+ )
326
+
327
+
328
+ def _build_worker_prompt(
329
+ *,
330
+ worker_name: str,
331
+ worker_role: str,
332
+ worker_lane: str,
333
+ target_hook: str,
334
+ target_file: str,
335
+ recent_entries: list[dict[str, object]],
336
+ ) -> str:
337
+ history_lines: list[str] = []
338
+ for entry in recent_entries:
339
+ if not entry.get("accepted"):
340
+ continue
341
+ benchmark = entry.get("benchmark") or {}
342
+ history_lines.append(
343
+ f"- {entry['worker_name']}: score={benchmark.get('score')} "
344
+ f"elo={benchmark.get('elo_delta_estimate')} summary={entry.get('summary')}"
345
+ )
346
+ recent_history = "\n".join(history_lines) if history_lines else "- no accepted candidates yet"
347
+ return f"""Improve the current Chess960 champion in `{target_file}`.
348
+
349
+ Lane:
350
+ - {worker_lane}
351
+
352
+ Target hook:
353
+ - edit only `{target_hook}` and keep the rest of the file unchanged
354
+ - if you need helper values, define them directly inside `{target_hook}` or make the smallest possible local constant change
355
+
356
+ Before editing, inspect:
357
+ - `{target_file}`
358
+ - `outputs/codex_swarm/champion_eval.py`
359
+ - `outputs/codex_swarm/champion_search.py`
360
+ - `outputs/codex_swarm/ledger.jsonl`
361
+ - `outputs/codex_swarm/accepted/`
362
+
363
+ Requirements:
364
+ - edit only `{target_file}`
365
+ - make one small surgical patch inside `{target_hook}`
366
+ - avoid duplicating prior accepted winners
367
+ - do not run held-out benchmarks; the coordinator does that
368
+ - finish quickly with a short summary of the patch and why it should help
369
+
370
+ Recent accepted candidates:
371
+ {recent_history}
372
+ """
373
+
374
+
375
+ def _build_worker_agents_override(
376
+ *,
377
+ worker_name: str,
378
+ worker_role: str,
379
+ worker_lane: str,
380
+ target_hook: str,
381
+ target_file: str,
382
+ max_diff_lines: int,
383
+ ) -> str:
384
+ return f"""# Codex swarm worker override
385
+
386
+ You are {worker_name}, the {worker_role}, in the 0x960 Codex swarm.
387
+
388
+ Primary lane:
389
+ - {worker_lane}
390
+
391
+ Hard requirements:
392
+ - Edit only `{target_file}`.
393
+ - Touch only the `{target_hook}` function body unless a tiny adjacent constant change is absolutely necessary.
394
+ - Use `apply_patch` or similarly surgical edits. Do not rewrite the whole file.
395
+ - Keep the final diff within about {max_diff_lines} changed lines total.
396
+ - If your current diff exceeds that budget, revert the excess and reduce the patch before finishing.
397
+ - Run at most one small local probe. Do not run held-out benchmarks yourself; the coordinator handles them.
398
+ - Do not browse the web or use internet-dependent tools for this task.
399
+ - Stop immediately after the patch and one tiny sanity check. Do not spend time on extra diffs, `rg`, or `sed` inspections once the patch is in place.
400
+ - Write a concise summary of the change and why it should help, then exit.
401
+ """
402
+
403
+
404
+ def _surface_config(surface: str) -> tuple[tuple[str, ...], str]:
405
+ if surface == "search":
406
+ return DEFAULT_SEARCH_EDITABLE_FILES, DEFAULT_SEARCH_EDITABLE_FILES[0]
407
+ return DEFAULT_EVAL_EDITABLE_FILES, DEFAULT_EVAL_EDITABLE_FILES[0]
408
+
409
+
410
+ def _screen_settings(args: argparse.Namespace) -> tuple[int, int, int]:
411
+ if args.surface == "search":
412
+ return args.search_screen_positions, args.search_screen_depth, args.search_screen_max_plies
413
+ return args.screen_positions, args.depth, args.max_plies
414
+
415
+
416
+ def _baseline_snapshot_root(paths: SwarmPaths, round_dir: Path) -> Path:
417
+ baseline_root = round_dir / "baseline_root"
418
+ _copy_file(
419
+ paths.champion_eval,
420
+ baseline_root / "src" / "zero960" / "workspace_template" / "eval.py",
421
+ )
422
+ _copy_file(
423
+ paths.champion_search,
424
+ baseline_root / "src" / "zero960" / "engine" / "search.py",
425
+ )
426
+ return baseline_root
427
+
428
+
429
+ def _snapshot_files(worker_dir: Path, rel_paths: tuple[str, ...]) -> dict[str, str]:
430
+ snapshot: dict[str, str] = {}
431
+ for rel_path in rel_paths:
432
+ file_path = worker_dir / rel_path
433
+ if file_path.exists():
434
+ snapshot[rel_path] = file_path.read_text(encoding="utf-8")
435
+ return snapshot
436
+
437
+
438
+ def _changed_snapshot_paths(before: dict[str, str], worker_dir: Path, rel_paths: tuple[str, ...]) -> list[str]:
439
+ changed: list[str] = []
440
+ for rel_path in rel_paths:
441
+ file_path = worker_dir / rel_path
442
+ after = file_path.read_text(encoding="utf-8") if file_path.exists() else ""
443
+ if before.get(rel_path, "") != after:
444
+ changed.append(rel_path)
445
+ return changed
446
+
447
+
448
+ def _snapshot_diff_line_counts(
449
+ before: dict[str, str],
450
+ worker_dir: Path,
451
+ rel_paths: tuple[str, ...],
452
+ ) -> tuple[int, int]:
453
+ added = 0
454
+ deleted = 0
455
+ for rel_path in rel_paths:
456
+ before_text = before.get(rel_path, "")
457
+ file_path = worker_dir / rel_path
458
+ after_text = file_path.read_text(encoding="utf-8") if file_path.exists() else ""
459
+ for line in difflib.unified_diff(
460
+ before_text.splitlines(),
461
+ after_text.splitlines(),
462
+ fromfile=rel_path,
463
+ tofile=rel_path,
464
+ lineterm="",
465
+ ):
466
+ if line.startswith(("---", "+++", "@@")):
467
+ continue
468
+ if line.startswith("+"):
469
+ added += 1
470
+ elif line.startswith("-"):
471
+ deleted += 1
472
+ return added, deleted
473
+
474
+
475
+ def _write_json(path: Path, payload: dict[str, object]) -> None:
476
+ path.parent.mkdir(parents=True, exist_ok=True)
477
+ path.write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n", encoding="utf-8")
478
+
479
+
480
+ def _append_jsonl(path: Path, payload: dict[str, object]) -> None:
481
+ path.parent.mkdir(parents=True, exist_ok=True)
482
+ with path.open("a", encoding="utf-8") as handle:
483
+ handle.write(json.dumps(payload, sort_keys=True) + "\n")
484
+
485
+
486
+ def _decode_timeout_output(payload: str | bytes | None) -> str:
487
+ if payload is None:
488
+ return ""
489
+ if isinstance(payload, bytes):
490
+ return payload.decode("utf-8", errors="replace")
491
+ return payload
492
+
493
+
494
+ def _run_worker(
495
+ *,
496
+ paths: SwarmPaths,
497
+ worker_dir: Path,
498
+ round_dir: Path,
499
+ worker_name: str,
500
+ worker_role: str,
501
+ worker_lane: str,
502
+ target_hook: str,
503
+ target_file: str,
504
+ model: str,
505
+ editable_files: tuple[str, ...],
506
+ candidate_file_rel: str,
507
+ positions: int,
508
+ depth: int,
509
+ max_plies: int,
510
+ seed: int,
511
+ min_score: float,
512
+ max_diff_lines: int,
513
+ worker_timeout_sec: int,
514
+ dry_run: bool,
515
+ sandbox_mode: str,
516
+ ) -> WorkerResult:
517
+ worker_dir = worker_dir.resolve()
518
+ candidate_file = worker_dir / candidate_file_rel
519
+ prompt_path = round_dir / f"{worker_name}_prompt.txt"
520
+ final_message_path = round_dir / f"{worker_name}_final.txt"
521
+ stdout_path = round_dir / f"{worker_name}_stdout.log"
522
+ stderr_path = round_dir / f"{worker_name}_stderr.log"
523
+ recent_entries = _last_ledger_entries(paths)
524
+ prompt = _build_worker_prompt(
525
+ worker_name=worker_name,
526
+ worker_role=worker_role,
527
+ worker_lane=worker_lane,
528
+ target_hook=target_hook,
529
+ target_file=target_file,
530
+ recent_entries=recent_entries,
531
+ )
532
+ prompt_path.write_text(prompt, encoding="utf-8")
533
+ (worker_dir / "AGENTS.override.md").write_text(
534
+ _build_worker_agents_override(
535
+ worker_name=worker_name,
536
+ worker_role=worker_role,
537
+ worker_lane=worker_lane,
538
+ target_hook=target_hook,
539
+ target_file=target_file,
540
+ max_diff_lines=max_diff_lines,
541
+ ),
542
+ encoding="utf-8",
543
+ )
544
+ before_snapshot = _snapshot_files(worker_dir, editable_files)
545
+
546
+ if dry_run:
547
+ stdout_path.write_text("dry-run\n", encoding="utf-8")
548
+ stderr_path.write_text("", encoding="utf-8")
549
+ final_message_path.write_text("dry-run\n", encoding="utf-8")
550
+ changed_files = _changed_snapshot_paths(before_snapshot, worker_dir, editable_files)
551
+ return WorkerResult(
552
+ worker_name=worker_name,
553
+ worktree_dir=worker_dir,
554
+ round_dir=round_dir,
555
+ prompt_path=prompt_path,
556
+ final_message_path=final_message_path,
557
+ stdout_path=stdout_path,
558
+ stderr_path=stderr_path,
559
+ candidate_file=candidate_file,
560
+ changed_files=changed_files,
561
+ diff_lines_added=0,
562
+ diff_lines_deleted=0,
563
+ screen_benchmark=None,
564
+ benchmark=None,
565
+ exit_code=0,
566
+ accepted=False,
567
+ summary="dry-run",
568
+ sandbox_mode=sandbox_mode,
569
+ )
570
+
571
+ try:
572
+ completed = subprocess.run(
573
+ [
574
+ "codex",
575
+ "exec",
576
+ "-m",
577
+ model,
578
+ "--full-auto",
579
+ "--ephemeral",
580
+ "--json",
581
+ "-c",
582
+ 'web_search="disabled"',
583
+ "--color",
584
+ "never",
585
+ "--output-last-message",
586
+ str(final_message_path),
587
+ "-",
588
+ ],
589
+ cwd=worker_dir,
590
+ input=prompt,
591
+ text=True,
592
+ capture_output=True,
593
+ check=False,
594
+ timeout=worker_timeout_sec,
595
+ )
596
+ stdout_path.write_text(completed.stdout, encoding="utf-8")
597
+ stderr_path.write_text(completed.stderr, encoding="utf-8")
598
+ except subprocess.TimeoutExpired as exc:
599
+ stdout_text = _decode_timeout_output(exc.stdout)
600
+ stderr_text = _decode_timeout_output(exc.stderr)
601
+ stdout_path.write_text(stdout_text, encoding="utf-8")
602
+ stderr_path.write_text(stderr_text + f"\nTimed out after {worker_timeout_sec} seconds.\n", encoding="utf-8")
603
+ if not final_message_path.exists():
604
+ final_message_path.write_text("", encoding="utf-8")
605
+ changed_files = _changed_snapshot_paths(before_snapshot, worker_dir, editable_files)
606
+ diff_lines_added, diff_lines_deleted = _snapshot_diff_line_counts(before_snapshot, worker_dir, editable_files)
607
+ return WorkerResult(
608
+ worker_name=worker_name,
609
+ worktree_dir=worker_dir,
610
+ round_dir=round_dir,
611
+ prompt_path=prompt_path,
612
+ final_message_path=final_message_path,
613
+ stdout_path=stdout_path,
614
+ stderr_path=stderr_path,
615
+ candidate_file=candidate_file,
616
+ changed_files=changed_files,
617
+ diff_lines_added=diff_lines_added,
618
+ diff_lines_deleted=diff_lines_deleted,
619
+ screen_benchmark=None,
620
+ benchmark=None,
621
+ exit_code=None,
622
+ accepted=False,
623
+ summary=f"timed out after {worker_timeout_sec}s",
624
+ sandbox_mode=sandbox_mode,
625
+ )
626
+ if not final_message_path.exists():
627
+ final_message_path.write_text("", encoding="utf-8")
628
+
629
+ changed_files = _changed_snapshot_paths(before_snapshot, worker_dir, editable_files)
630
+ diff_lines_added, diff_lines_deleted = _snapshot_diff_line_counts(before_snapshot, worker_dir, editable_files)
631
+ summary = final_message_path.read_text(encoding="utf-8").strip()
632
+ if completed.returncode != 0:
633
+ summary = summary or "codex exec failed"
634
+
635
+ return WorkerResult(
636
+ worker_name=worker_name,
637
+ worktree_dir=worker_dir,
638
+ round_dir=round_dir,
639
+ prompt_path=prompt_path,
640
+ final_message_path=final_message_path,
641
+ stdout_path=stdout_path,
642
+ stderr_path=stderr_path,
643
+ candidate_file=candidate_file,
644
+ changed_files=changed_files,
645
+ diff_lines_added=diff_lines_added,
646
+ diff_lines_deleted=diff_lines_deleted,
647
+ screen_benchmark=None,
648
+ benchmark=None,
649
+ exit_code=completed.returncode,
650
+ accepted=False,
651
+ summary=summary,
652
+ sandbox_mode=sandbox_mode,
653
+ )
654
+
655
+
656
+ def _eligible_for_screen(result: WorkerResult, max_diff_lines: int) -> bool:
657
+ if result.exit_code not in (0, None):
658
+ return False
659
+ if not result.candidate_file.exists():
660
+ return False
661
+ if not result.changed_files:
662
+ return False
663
+ return (result.diff_lines_added + result.diff_lines_deleted) <= max_diff_lines
664
+
665
+
666
+ def _candidate_compiles(candidate_file: Path) -> bool:
667
+ completed = subprocess.run(
668
+ ["python3", "-m", "py_compile", str(candidate_file)],
669
+ cwd=_repo_root(),
670
+ capture_output=True,
671
+ text=True,
672
+ check=False,
673
+ )
674
+ return completed.returncode == 0
675
+
676
+
677
+ def _benchmark_eval_task(
678
+ candidate_file: str,
679
+ baseline_file: str,
680
+ positions: int,
681
+ depth: int,
682
+ max_plies: int,
683
+ seed: int,
684
+ ) -> BenchmarkResult:
685
+ return benchmark_eval_files(
686
+ Path(candidate_file).resolve(),
687
+ Path(baseline_file).resolve(),
688
+ positions=positions,
689
+ depth=depth,
690
+ max_plies=max_plies,
691
+ seed=seed,
692
+ )
693
+
694
+
695
+ def _benchmark_engine_task(
696
+ candidate_root: str,
697
+ baseline_root: str,
698
+ positions: int,
699
+ depth: int,
700
+ max_plies: int,
701
+ seed: int,
702
+ ) -> BenchmarkResult:
703
+ return benchmark_engine_roots(
704
+ Path(candidate_root).resolve(),
705
+ Path(baseline_root).resolve(),
706
+ positions=positions,
707
+ depth=depth,
708
+ max_plies=max_plies,
709
+ seed=seed,
710
+ )
711
+
712
+
713
+ def _run_benchmark_with_timeout(
714
+ *,
715
+ surface: str,
716
+ candidate_path: Path,
717
+ baseline_path: Path,
718
+ positions: int,
719
+ depth: int,
720
+ max_plies: int,
721
+ seed: int,
722
+ timeout_sec: int,
723
+ ) -> BenchmarkResult | None:
724
+ task = _benchmark_engine_task if surface == "search" else _benchmark_eval_task
725
+ with ProcessPoolExecutor(max_workers=1) as executor:
726
+ future = executor.submit(
727
+ task,
728
+ str(candidate_path),
729
+ str(baseline_path),
730
+ positions,
731
+ depth,
732
+ max_plies,
733
+ seed,
734
+ )
735
+ try:
736
+ return future.result(timeout=timeout_sec)
737
+ except TimeoutError:
738
+ future.cancel()
739
+ return None
740
+
741
+
742
+ def _best_screened(results: list[WorkerResult], screen_min_score: float, surface: str) -> WorkerResult | None:
743
+ comparator = (
744
+ (lambda score: score >= screen_min_score)
745
+ if surface == "search"
746
+ else (lambda score: score > screen_min_score)
747
+ )
748
+ screened = [
749
+ result
750
+ for result in results
751
+ if result.screen_benchmark is not None and comparator(result.screen_benchmark.score)
752
+ ]
753
+ if not screened:
754
+ return None
755
+ return max(screened, key=lambda result: result.screen_benchmark.score)
756
+
757
+
758
+ def _promote_winner(paths: SwarmPaths, winner: WorkerResult, promote_source: bool) -> None:
759
+ accepted_dir = paths.state_root / "accepted"
760
+ timestamp = datetime.now(UTC).strftime("%Y%m%dT%H%M%SZ")
761
+ if "src/zero960/workspace_template/eval.py" in winner.changed_files:
762
+ _copy_file(winner.worktree_dir / "src/zero960/workspace_template/eval.py", paths.champion_eval)
763
+ _copy_file(
764
+ winner.worktree_dir / "src/zero960/workspace_template/eval.py",
765
+ accepted_dir / f"{timestamp}_{winner.worker_name}_eval.py",
766
+ )
767
+ if "src/zero960/engine/search.py" in winner.changed_files:
768
+ _copy_file(winner.worktree_dir / "src/zero960/engine/search.py", paths.champion_search)
769
+ _copy_file(
770
+ winner.worktree_dir / "src/zero960/engine/search.py",
771
+ accepted_dir / f"{timestamp}_{winner.worker_name}_search.py",
772
+ )
773
+ if promote_source and "src/zero960/workspace_template/eval.py" in winner.changed_files:
774
+ _copy_file(winner.candidate_file, paths.repo_root / "src/zero960/workspace_template/eval.py")
775
+ _copy_file(winner.candidate_file, paths.repo_root / "src/zero960/engine/default_eval.py")
776
+ if promote_source and "src/zero960/engine/search.py" in winner.changed_files:
777
+ _copy_file(winner.worktree_dir / "src/zero960/engine/search.py", paths.repo_root / "src/zero960/engine/search.py")
778
+
779
+
780
+ def _state_summary(paths: SwarmPaths) -> str:
781
+ entries = [
782
+ entry
783
+ for entry in _last_ledger_entries(paths, limit=20)
784
+ if (entry.get("benchmark") or {}).get("score") is not None
785
+ ]
786
+ if not paths.champion_eval.exists():
787
+ return "no champion yet"
788
+ if not entries:
789
+ return f"champion={paths.champion_eval}"
790
+ last = entries[-1]
791
+ benchmark = last.get("benchmark") or {}
792
+ return (
793
+ f"champion={paths.champion_eval} "
794
+ f"last_worker={last.get('worker_name')} "
795
+ f"score={benchmark.get('score')} "
796
+ f"elo={benchmark.get('elo_delta_estimate')}"
797
+ )
798
+
799
+
800
+ def parse_args() -> argparse.Namespace:
801
+ paths = _default_paths()
802
+ parser = argparse.ArgumentParser(description=__doc__)
803
+ subparsers = parser.add_subparsers(dest="command", required=True)
804
+
805
+ setup = subparsers.add_parser("setup", help="Create local worker worktrees and initialize the champion.")
806
+ setup.add_argument("--workers", type=int, default=DEFAULT_WORKER_COUNT)
807
+ setup.add_argument("--worktree-root", default=str(paths.worktree_root))
808
+ setup.add_argument("--reset-champion", action="store_true")
809
+
810
+ run = subparsers.add_parser("run", help="Run one or more champion/challenger rounds.")
811
+ run.add_argument("--workers", type=int, default=DEFAULT_WORKER_COUNT)
812
+ run.add_argument("--rounds", type=int, default=1)
813
+ run.add_argument("--model", default=DEFAULT_MODEL)
814
+ run.add_argument("--surface", choices=("eval", "search"), default=DEFAULT_SURFACE)
815
+ run.add_argument("--worktree-root", default=str(paths.worktree_root))
816
+ run.add_argument("--screen-positions", type=int, default=DEFAULT_SCREEN_POSITIONS)
817
+ run.add_argument("--positions", type=int, default=DEFAULT_POSITIONS)
818
+ run.add_argument("--depth", type=int, default=DEFAULT_DEPTH)
819
+ run.add_argument("--max-plies", type=int, default=DEFAULT_MAX_PLIES)
820
+ run.add_argument("--search-screen-positions", type=int, default=DEFAULT_SEARCH_SCREEN_POSITIONS)
821
+ run.add_argument("--search-screen-depth", type=int, default=DEFAULT_SEARCH_SCREEN_DEPTH)
822
+ run.add_argument("--search-screen-max-plies", type=int, default=DEFAULT_SEARCH_SCREEN_MAX_PLIES)
823
+ run.add_argument("--seed", type=int, default=42)
824
+ run.add_argument("--screen-min-score", type=float, default=DEFAULT_SCREEN_MIN_SCORE)
825
+ run.add_argument("--min-score", type=float, default=DEFAULT_MIN_SCORE)
826
+ run.add_argument("--max-diff-lines", type=int, default=DEFAULT_MAX_DIFF_LINES)
827
+ run.add_argument("--worker-timeout-sec", type=int, default=DEFAULT_WORKER_TIMEOUT_SEC)
828
+ run.add_argument("--benchmark-timeout-sec", type=int, default=DEFAULT_BENCHMARK_TIMEOUT_SEC)
829
+ run.add_argument("--final-benchmark-timeout-sec", type=int, default=DEFAULT_FINAL_BENCHMARK_TIMEOUT_SEC)
830
+ run.add_argument("--dry-run", action="store_true")
831
+ run.add_argument("--serial", action="store_true", help="Run workers sequentially instead of in parallel.")
832
+ run.add_argument("--promote-source", action="store_true")
833
+ run.add_argument("--continuous", action="store_true", help="Keep running rounds until interrupted or stalled.")
834
+ run.add_argument(
835
+ "--max-stall-rounds",
836
+ type=int,
837
+ default=3,
838
+ help="Stop continuous mode after this many consecutive non-promotion rounds. Use 0 to disable.",
839
+ )
840
+ run.add_argument("--sleep-sec", type=float, default=0.0, help="Sleep between continuous rounds.")
841
+
842
+ status = subparsers.add_parser("status", help="Print the current champion summary and recent results.")
843
+
844
+ promote = subparsers.add_parser("promote", help="Copy the current swarm champion into the source tree.")
845
+ promote.add_argument("--source-only", action="store_true", help="Skip copying to default_eval.py.")
846
+
847
+ return parser.parse_args()
848
+
849
+
850
+ def _resolve_paths(args: argparse.Namespace) -> SwarmPaths:
851
+ paths = _default_paths()
852
+ if hasattr(args, "worktree_root"):
853
+ paths.worktree_root = Path(args.worktree_root).resolve()
854
+ return paths
855
+
856
+
857
+ def _setup_command(args: argparse.Namespace) -> int:
858
+ paths = _resolve_paths(args)
859
+ _ensure_state_dirs(paths)
860
+ if args.reset_champion or not paths.champion_eval.exists():
861
+ _copy_file(paths.repo_root / "src/zero960/workspace_template/eval.py", paths.champion_eval)
862
+ if args.reset_champion or not paths.champion_search.exists():
863
+ _copy_file(paths.repo_root / "src/zero960/engine/search.py", paths.champion_search)
864
+ worker_dirs = _setup_workers(paths, args.workers, DEFAULT_SYNC_PATHS)
865
+ print(f"initialized champion: {paths.champion_eval}")
866
+ for worker_dir, sandbox_mode in worker_dirs:
867
+ print(f"worker: {worker_dir} mode={sandbox_mode}")
868
+ return 0
869
+
870
+
871
+ def _run_command(args: argparse.Namespace) -> int:
872
+ paths = _resolve_paths(args)
873
+ _ensure_state_dirs(paths)
874
+ if not paths.champion_eval.exists():
875
+ _copy_file(paths.repo_root / "src/zero960/workspace_template/eval.py", paths.champion_eval)
876
+ if not paths.champion_search.exists():
877
+ _copy_file(paths.repo_root / "src/zero960/engine/search.py", paths.champion_search)
878
+ worker_dirs = _setup_workers(paths, args.workers, DEFAULT_SYNC_PATHS)
879
+ editable_files, candidate_file_rel = _surface_config(args.surface)
880
+
881
+ round_index = 0
882
+ stall_rounds = 0
883
+ target_rounds = None if args.continuous else args.rounds
884
+
885
+ while target_rounds is None or round_index < target_rounds:
886
+ round_index += 1
887
+ round_seed = args.seed + round_index - 1
888
+ round_timestamp = datetime.now(UTC).strftime("%Y%m%dT%H%M%SZ")
889
+ round_dir = paths.state_root / "runs" / f"round_{round_timestamp}_{round_index}"
890
+ round_dir.mkdir(parents=True, exist_ok=True)
891
+ baseline_root = _baseline_snapshot_root(paths, round_dir)
892
+ round_specializations = _ordered_specializations(paths, args.surface)
893
+ screen_positions, screen_depth, screen_max_plies = _screen_settings(args)
894
+ print(f"round {round_index}: champion frozen at {paths.champion_eval}")
895
+ print(f"round {round_index}: surface={args.surface}")
896
+ print(
897
+ "round hooks: "
898
+ + ", ".join(spec[2] for spec in round_specializations[: len(worker_dirs)])
899
+ )
900
+
901
+ jobs = []
902
+ if args.serial or args.dry_run:
903
+ for worker_index, (worker_dir, sandbox_mode) in enumerate(worker_dirs, start=1):
904
+ _sync_worker_snapshot(paths, worker_dir, DEFAULT_SYNC_PATHS)
905
+ worker_role, worker_lane, target_hook = round_specializations[(worker_index - 1) % len(round_specializations)]
906
+ result = _run_worker(
907
+ paths=paths,
908
+ worker_dir=worker_dir,
909
+ round_dir=round_dir,
910
+ worker_name=f"worker-{worker_index}",
911
+ worker_role=worker_role,
912
+ worker_lane=worker_lane,
913
+ target_hook=target_hook,
914
+ target_file=candidate_file_rel,
915
+ model=args.model,
916
+ editable_files=editable_files,
917
+ candidate_file_rel=candidate_file_rel,
918
+ positions=args.positions,
919
+ depth=args.depth,
920
+ max_plies=args.max_plies,
921
+ seed=round_seed,
922
+ min_score=args.min_score,
923
+ max_diff_lines=args.max_diff_lines,
924
+ worker_timeout_sec=args.worker_timeout_sec,
925
+ dry_run=args.dry_run,
926
+ sandbox_mode=sandbox_mode,
927
+ )
928
+ jobs.append(result)
929
+ else:
930
+ with ThreadPoolExecutor(max_workers=len(worker_dirs)) as executor:
931
+ futures = []
932
+ for worker_index, (worker_dir, sandbox_mode) in enumerate(worker_dirs, start=1):
933
+ _sync_worker_snapshot(paths, worker_dir, DEFAULT_SYNC_PATHS)
934
+ worker_role, worker_lane, target_hook = round_specializations[(worker_index - 1) % len(round_specializations)]
935
+ futures.append(
936
+ executor.submit(
937
+ _run_worker,
938
+ paths=paths,
939
+ worker_dir=worker_dir,
940
+ round_dir=round_dir,
941
+ worker_name=f"worker-{worker_index}",
942
+ worker_role=worker_role,
943
+ worker_lane=worker_lane,
944
+ target_hook=target_hook,
945
+ target_file=candidate_file_rel,
946
+ model=args.model,
947
+ editable_files=editable_files,
948
+ candidate_file_rel=candidate_file_rel,
949
+ positions=args.positions,
950
+ depth=args.depth,
951
+ max_plies=args.max_plies,
952
+ seed=round_seed,
953
+ min_score=args.min_score,
954
+ max_diff_lines=args.max_diff_lines,
955
+ worker_timeout_sec=args.worker_timeout_sec,
956
+ dry_run=args.dry_run,
957
+ sandbox_mode=sandbox_mode,
958
+ )
959
+ )
960
+ for future in as_completed(futures):
961
+ jobs.append(future.result())
962
+
963
+ jobs.sort(key=lambda result: result.worker_name)
964
+ for result in jobs:
965
+ diff_total = result.diff_lines_added + result.diff_lines_deleted
966
+ if result.exit_code not in (0, None):
967
+ continue
968
+ if not result.candidate_file.exists():
969
+ continue
970
+ if not result.changed_files:
971
+ rejection = "rejected before benchmark: no file changes"
972
+ result.summary = f"{result.summary}\n{rejection}".strip() if result.summary else rejection
973
+ continue
974
+ if diff_total > args.max_diff_lines:
975
+ overflow = diff_total - args.max_diff_lines
976
+ rejection = f"rejected before benchmark: diff budget exceeded by {overflow} lines"
977
+ result.summary = f"{result.summary}\n{rejection}".strip() if result.summary else rejection
978
+ continue
979
+ if not _candidate_compiles(result.candidate_file):
980
+ rejection = "rejected before benchmark: candidate failed py_compile"
981
+ result.summary = f"{result.summary}\n{rejection}".strip() if result.summary else rejection
982
+ continue
983
+ screen_candidate = result.worktree_dir.resolve() if args.surface == "search" else result.candidate_file.resolve()
984
+ screen_baseline = (
985
+ baseline_root.resolve()
986
+ if args.surface == "search"
987
+ else (baseline_root / "src" / "zero960" / "workspace_template" / "eval.py").resolve()
988
+ )
989
+ result.screen_benchmark = _run_benchmark_with_timeout(
990
+ surface=args.surface,
991
+ candidate_path=screen_candidate,
992
+ baseline_path=screen_baseline,
993
+ positions=screen_positions,
994
+ depth=screen_depth,
995
+ max_plies=screen_max_plies,
996
+ seed=round_seed,
997
+ timeout_sec=args.benchmark_timeout_sec,
998
+ )
999
+ if result.screen_benchmark is None:
1000
+ rejection = f"rejected during screen benchmark: timed out after {args.benchmark_timeout_sec}s"
1001
+ result.summary = f"{result.summary}\n{rejection}".strip() if result.summary else rejection
1002
+
1003
+ winner = _best_screened(jobs, args.screen_min_score, args.surface)
1004
+ if winner is not None:
1005
+ final_candidate = winner.worktree_dir.resolve() if args.surface == "search" else winner.candidate_file.resolve()
1006
+ final_baseline = (
1007
+ baseline_root.resolve()
1008
+ if args.surface == "search"
1009
+ else (baseline_root / "src" / "zero960" / "workspace_template" / "eval.py").resolve()
1010
+ )
1011
+ winner.benchmark = _run_benchmark_with_timeout(
1012
+ surface=args.surface,
1013
+ candidate_path=final_candidate,
1014
+ baseline_path=final_baseline,
1015
+ positions=args.positions,
1016
+ depth=args.depth,
1017
+ max_plies=args.max_plies,
1018
+ seed=round_seed,
1019
+ timeout_sec=args.final_benchmark_timeout_sec,
1020
+ )
1021
+ winner.accepted = winner.benchmark is not None and winner.benchmark.score > args.min_score
1022
+ if winner.benchmark is None:
1023
+ rejection = f"screen winner timed out in final benchmark after {args.final_benchmark_timeout_sec}s"
1024
+ winner.summary = f"{winner.summary}\n{rejection}".strip() if winner.summary else rejection
1025
+ winner = None
1026
+ elif not winner.accepted:
1027
+ rejection = (
1028
+ f"screen winner failed final benchmark: "
1029
+ f"{winner.benchmark.score:.3f} <= {args.min_score:.3f}"
1030
+ )
1031
+ winner.summary = f"{winner.summary}\n{rejection}".strip() if winner.summary else rejection
1032
+ winner = None
1033
+
1034
+ for result in jobs:
1035
+ payload = result.to_json()
1036
+ payload["round_index"] = round_index
1037
+ payload["winner"] = bool(winner and winner.worker_name == result.worker_name)
1038
+ payload["surface"] = args.surface
1039
+ _write_json(round_dir / f"{result.worker_name}_result.json", payload)
1040
+ if not args.dry_run:
1041
+ _append_jsonl(paths.ledger_path, payload)
1042
+ screen_text = "n/a" if result.screen_benchmark is None else f"{result.screen_benchmark.score:.3f}"
1043
+ final_text = "n/a" if result.benchmark is None else f"{result.benchmark.score:.3f}"
1044
+ print(
1045
+ f"{result.worker_name}: exit={result.exit_code} "
1046
+ f"screen={screen_text} final={final_text} accepted={result.accepted} changed={len(result.changed_files)} "
1047
+ f"diff=+{result.diff_lines_added}/-{result.diff_lines_deleted} mode={result.sandbox_mode}"
1048
+ )
1049
+
1050
+ if winner is None:
1051
+ print(f"round {round_index}: no challenger beat the champion")
1052
+ stall_rounds += 1
1053
+ if args.continuous and args.max_stall_rounds and stall_rounds >= args.max_stall_rounds:
1054
+ print(f"stopping after {stall_rounds} consecutive non-promotion rounds")
1055
+ break
1056
+ if args.continuous and args.sleep_sec > 0:
1057
+ time.sleep(args.sleep_sec)
1058
+ continue
1059
+
1060
+ _promote_winner(paths, winner, args.promote_source)
1061
+ stall_rounds = 0
1062
+ print(
1063
+ f"round {round_index}: promoted {winner.worker_name} "
1064
+ f"score={winner.benchmark.score:.3f} elo={winner.benchmark.elo_delta_estimate:.1f}"
1065
+ )
1066
+ if args.continuous and args.sleep_sec > 0:
1067
+ time.sleep(args.sleep_sec)
1068
+
1069
+ print(_state_summary(paths))
1070
+ return 0
1071
+
1072
+
1073
+ def _status_command() -> int:
1074
+ paths = _default_paths()
1075
+ print(_state_summary(paths))
1076
+ for entry in _last_ledger_entries(paths):
1077
+ benchmark = entry.get("benchmark") or {}
1078
+ if benchmark.get("score") is None:
1079
+ continue
1080
+ print(
1081
+ f"{entry.get('worker_name')}: accepted={entry.get('accepted')} "
1082
+ f"score={benchmark.get('score')} elo={benchmark.get('elo_delta_estimate')}"
1083
+ )
1084
+ return 0
1085
+
1086
+
1087
+ def _promote_command(args: argparse.Namespace) -> int:
1088
+ paths = _default_paths()
1089
+ if not paths.champion_eval.exists():
1090
+ raise SystemExit("no champion available; run setup or run first")
1091
+ _copy_file(paths.champion_eval, paths.repo_root / "src/zero960/workspace_template/eval.py")
1092
+ if paths.champion_search.exists():
1093
+ _copy_file(paths.champion_search, paths.repo_root / "src/zero960/engine/search.py")
1094
+ if not args.source_only:
1095
+ _copy_file(paths.champion_eval, paths.repo_root / "src/zero960/engine/default_eval.py")
1096
+ print(f"promoted champion from {paths.champion_eval}")
1097
+ return 0
1098
+
1099
+
1100
+ def main() -> None:
1101
+ args = parse_args()
1102
+ if args.command == "setup":
1103
+ raise SystemExit(_setup_command(args))
1104
+ if args.command == "run":
1105
+ raise SystemExit(_run_command(args))
1106
+ if args.command == "status":
1107
+ raise SystemExit(_status_command())
1108
+ if args.command == "promote":
1109
+ raise SystemExit(_promote_command(args))
1110
+ raise SystemExit(f"unknown command: {args.command}")
1111
+
1112
+
1113
+ if __name__ == "__main__":
1114
+ main()
train/minimal_trl_openenv.py CHANGED
@@ -9,6 +9,7 @@ Modes:
9
  from __future__ import annotations
10
 
11
  import argparse
 
12
  import json
13
  import os
14
  import re
@@ -19,17 +20,78 @@ from pathlib import Path
19
  from zero960_env.client import Zero960Client
20
  from zero960_env.models import Zero960Action
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  SYSTEM_PROMPT = (
23
  "You are a Chess960 evaluation engineer. You can take ONE action per turn.\n"
24
- "Actions (respond with valid JSON only, no other text):\n"
 
 
25
  ' {"action_type":"read_file","path":"eval.py"}\n'
26
- ' {"action_type":"write_file","path":"eval.py","content":"<new code>"}\n'
27
  ' {"action_type":"run_static_eval"}\n'
28
  ' {"action_type":"run_match"}\n'
29
  ' {"action_type":"finish"}\n'
30
  "\n"
31
- "Goal: improve eval.py so the Chess960 engine beats the baseline.\n"
32
- "Strategy: read eval.py edit it run_match to test finish when satisfied."
 
 
 
 
 
 
 
 
 
33
  )
34
 
35
 
@@ -44,9 +106,14 @@ def format_observation_as_prompt(obs, system_prompt: str = SYSTEM_PROMPT) -> str
44
  f"Position index: {obs.start_position}\n"
45
  f"Steps remaining: {obs.remaining_steps}\n"
46
  f"Last match score: {obs.last_match_score}\n"
 
 
 
 
47
  f"History: {obs.history}\n\n"
48
  f"Current eval.py:\n```python\n{eval_code}\n```\n\n"
49
- "Choose your next action (JSON only)."
 
50
  )
51
  return user_msg
52
 
@@ -59,24 +126,380 @@ def format_messages(obs) -> list[dict[str, str]]:
59
  ]
60
 
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  def parse_llm_output(text: str) -> Zero960Action:
63
  """Best-effort parse of LLM output into a Zero960Action."""
64
- # Try to find JSON with nested braces (for write_file with content)
65
- json_match = re.search(r'\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}', text, re.DOTALL)
66
- if json_match:
67
- try:
68
- data = json.loads(json_match.group())
69
- return Zero960Action(**data)
70
- except (json.JSONDecodeError, ValueError):
71
- pass
72
- # Simpler JSON match
73
- json_match = re.search(r'\{[^}]+\}', text, re.DOTALL)
74
- if json_match:
75
  try:
76
- data = json.loads(json_match.group())
77
  return Zero960Action(**data)
78
  except (json.JSONDecodeError, ValueError):
79
- pass
 
 
 
 
 
 
 
 
 
 
 
 
80
  return Zero960Action(action_type="finish")
81
 
82
 
@@ -92,18 +515,22 @@ class RolloutSummary:
92
 
93
 
94
  def run_handcrafted_rollout(base_url: str) -> RolloutSummary:
95
- """Quick demo: connect, read eval, run match, finish."""
96
  with Zero960Client(base_url=base_url) as client:
97
  result = client.reset()
98
  obs = result.observation
99
 
100
- result = client.step(Zero960Action(action_type="read_file", path="eval.py"))
 
 
 
 
 
 
 
 
101
  obs = result.observation
102
 
103
- if obs.remaining_steps > 1:
104
- result = client.step(Zero960Action(action_type="run_static_eval"))
105
- obs = result.observation
106
-
107
  if obs.remaining_steps > 1:
108
  result = client.step(Zero960Action(action_type="run_match"))
109
  obs = result.observation
@@ -125,13 +552,15 @@ def run_handcrafted_rollout(base_url: str) -> RolloutSummary:
125
  def run_inference_test(
126
  base_url: str,
127
  model_name: str = "Qwen/Qwen3.5-9B",
 
128
  max_episode_steps: int = 6,
 
129
  ) -> RolloutSummary:
130
  """Run a single episode with Qwen generating actions against the live env."""
131
  from transformers import AutoModelForCausalLM, AutoTokenizer
132
 
133
  print(f"Loading {model_name}...")
134
- tokenizer = AutoTokenizer.from_pretrained(model_name)
135
  model = AutoModelForCausalLM.from_pretrained(
136
  model_name, torch_dtype="auto", device_map="auto",
137
  )
@@ -144,32 +573,28 @@ def run_inference_test(
144
  if result.done:
145
  break
146
 
147
- msgs = format_messages(obs)
148
- prompt = tokenizer.apply_chat_template(
149
- msgs, tokenize=False, add_generation_prompt=True,
150
- )
151
- inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
152
- outputs = model.generate(
153
- **inputs, max_new_tokens=1024,
154
- temperature=0.7, top_p=0.9, do_sample=True,
155
  )
156
- generated = tokenizer.decode(
157
- outputs[0][inputs["input_ids"].shape[1]:],
158
- skip_special_tokens=True,
159
  )
160
-
161
- print(f"\n--- Step {step_i + 1} ---")
162
- print(f"LLM output: {generated[:300]}...")
163
-
164
- action = parse_llm_output(generated)
165
  print(f"Parsed action: {action.action_type}", end="")
166
  if action.path:
167
  print(f" path={action.path}", end="")
168
  print()
 
 
169
 
170
  result = client.step(action)
171
  obs = result.observation
172
  print(f"Status: {obs.status_message}")
 
173
 
174
  if not result.done:
175
  result = client.step(Zero960Action(action_type="finish"))
@@ -217,10 +642,8 @@ def run_grpo_training(
217
  prompts = []
218
  for _ in range(n):
219
  result = env.reset()
220
- msgs = format_messages(result.observation)
221
- prompt_text = tokenizer.apply_chat_template(
222
- msgs, tokenize=False, add_generation_prompt=True,
223
- )
224
  prompts.append(prompt_text)
225
  return Dataset.from_dict({"prompt": prompts})
226
 
@@ -250,36 +673,43 @@ def run_grpo_training(
250
  "gen": i,
251
  "completion_preview": completion[:500],
252
  "completion_len": len(completion),
 
253
  }
254
  try:
255
  result = env.reset()
256
  obs = result.observation
257
 
258
- # First action from the model's completion
259
- action = parse_llm_output(completion)
 
260
  entry["parsed_action"] = action.action_type
261
  entry["parsed_path"] = action.path
262
  if action.action_type == "write_file" and action.content:
263
  entry["code_preview"] = action.content[:300]
264
  entry["code_len"] = len(action.content)
 
265
 
266
  result = env.step(action)
267
  obs = result.observation
268
  entry["env_status_1"] = obs.status_message
 
269
 
270
  # If the model wrote code, run a match to get a real score
271
  if not result.done and action.action_type == "write_file":
272
  result = env.step(Zero960Action(action_type="run_match"))
273
  obs = result.observation
274
  entry["match_score"] = obs.last_match_score
 
275
 
276
  # Finish to get terminal reward
277
  if not result.done:
278
  result = env.step(Zero960Action(action_type="finish"))
 
279
 
280
- reward = float(result.reward or 0.0)
281
  rewards.append(reward)
282
  entry["reward"] = reward
 
283
  except Exception as exc:
284
  rewards.append(0.0)
285
  entry["reward"] = 0.0
@@ -338,10 +768,9 @@ def run_grpo_training(
338
  learning_rate=5e-6,
339
  logging_steps=1,
340
  num_generations=num_generations,
341
- max_completion_length=512,
342
  bf16=True,
343
- gradient_checkpointing=True,
344
- gradient_checkpointing_kwargs={"use_reentrant": False},
345
  report_to="none",
346
  )
347
 
@@ -395,6 +824,11 @@ def main() -> None:
395
  )
396
  parser.add_argument("--base-url", default="http://127.0.0.1:8000")
397
  parser.add_argument("--model", default="Qwen/Qwen3.5-9B")
 
 
 
 
 
398
  parser.add_argument("--steps", type=int, default=20)
399
  parser.add_argument("--num-generations", type=int, default=4)
400
  parser.add_argument("--max-turns", type=int, default=6)
@@ -411,6 +845,7 @@ def main() -> None:
411
  summary = run_inference_test(
412
  base_url=args.base_url,
413
  model_name=args.model,
 
414
  )
415
  print({
416
  "reward": summary.reward,
 
9
  from __future__ import annotations
10
 
11
  import argparse
12
+ import ast
13
  import json
14
  import os
15
  import re
 
20
  from zero960_env.client import Zero960Client
21
  from zero960_env.models import Zero960Action
22
 
23
+ EXAMPLE_WRITE_ACTION = json.dumps(
24
+ {
25
+ "action_type": "write_file",
26
+ "path": "eval.py",
27
+ "content": (
28
+ "from __future__ import annotations\n\n"
29
+ "import chess\n\n"
30
+ "PIECE_VALUES = {\n"
31
+ " chess.PAWN: 100,\n"
32
+ " chess.KNIGHT: 320,\n"
33
+ " chess.BISHOP: 330,\n"
34
+ " chess.ROOK: 500,\n"
35
+ " chess.QUEEN: 900,\n"
36
+ " chess.KING: 0,\n"
37
+ "}\n\n"
38
+ "def evaluate(board: chess.Board) -> int:\n"
39
+ " score = 0\n"
40
+ " for piece_type, piece_value in PIECE_VALUES.items():\n"
41
+ " score += piece_value * len(board.pieces(piece_type, chess.WHITE))\n"
42
+ " score -= piece_value * len(board.pieces(piece_type, chess.BLACK))\n"
43
+ " return score\n"
44
+ ),
45
+ }
46
+ )
47
+
48
+ ACTION_SCHEMA_TEXT = (
49
+ "Return exactly one JSON object matching one of these shapes:\n"
50
+ '1. {"action_type":"write_file","path":"eval.py","content":"<full eval.py source>"}\n'
51
+ '2. {"action_type":"run_match"}\n'
52
+ '3. {"action_type":"finish"}\n'
53
+ '4. {"action_type":"run_static_eval"}\n'
54
+ '5. {"action_type":"read_file","path":"eval.py"}'
55
+ )
56
+
57
+ ACTION_CHOICE_MAP = {
58
+ "1": "write_file",
59
+ "2": "run_match",
60
+ "3": "finish",
61
+ "4": "run_static_eval",
62
+ "5": "read_file",
63
+ }
64
+
65
+ TRAIN_ACTION_REWARD_BIAS = {
66
+ "write_file": 0.35,
67
+ "run_match": -0.15,
68
+ "finish": -0.30,
69
+ "run_static_eval": -0.25,
70
+ "read_file": -0.30,
71
+ }
72
+
73
  SYSTEM_PROMPT = (
74
  "You are a Chess960 evaluation engineer. You can take ONE action per turn.\n"
75
+ "Respond with exactly one JSON object and no extra text.\n"
76
+ f"{ACTION_SCHEMA_TEXT}\n"
77
+ "Actions:\n"
78
  ' {"action_type":"read_file","path":"eval.py"}\n'
79
+ ' {"action_type":"write_file","path":"eval.py","content":"<full replacement eval.py>"}\n'
80
  ' {"action_type":"run_static_eval"}\n'
81
  ' {"action_type":"run_match"}\n'
82
  ' {"action_type":"finish"}\n'
83
  "\n"
84
+ "Important rules:\n"
85
+ "- The full current eval.py is already included in the observation, so read_file is usually unnecessary.\n"
86
+ "- High-reward loop: write_file a valid full replacement, run_match, then finish.\n"
87
+ "- Repeating run_static_eval, finishing before a write, or finishing before an explicit match is penalized.\n"
88
+ "- If you write code, keep it short and valid Python that defines evaluate(board).\n"
89
+ "- Do not output analysis, markdown, XML tags, or prose. Do not emit <think> blocks.\n"
90
+ "\n"
91
+ "Examples:\n"
92
+ f"Fresh episode best first move:\n{EXAMPLE_WRITE_ACTION}\n"
93
+ 'After a valid write, best next move:\n{"action_type":"run_match"}\n'
94
+ 'After a match score is available, best next move:\n{"action_type":"finish"}'
95
  )
96
 
97
 
 
106
  f"Position index: {obs.start_position}\n"
107
  f"Steps remaining: {obs.remaining_steps}\n"
108
  f"Last match score: {obs.last_match_score}\n"
109
+ f"Has valid edit: {obs.has_valid_edit}\n"
110
+ f"Has explicit match: {obs.has_run_match}\n"
111
+ f"Suggested actions: {', '.join(obs.suggested_actions)}\n"
112
+ f"Workflow hint: {obs.workflow_hint}\n"
113
  f"History: {obs.history}\n\n"
114
  f"Current eval.py:\n```python\n{eval_code}\n```\n\n"
115
+ f"{ACTION_SCHEMA_TEXT}\n"
116
+ "Choose your next action. Output JSON only."
117
  )
118
  return user_msg
119
 
 
126
  ]
127
 
128
 
129
+ def format_action_selection_messages(obs) -> list[dict[str, str]]:
130
+ """Ask the model to choose only the next action type ID."""
131
+ eval_code = obs.file_contents.get("eval.py", "<not read yet>")
132
+ return [
133
+ {
134
+ "role": "system",
135
+ "content": (
136
+ "Choose the next action for a Chess960 eval-editing task.\n"
137
+ "Return exactly one digit and nothing else.\n"
138
+ "1 = write_file\n"
139
+ "2 = run_match\n"
140
+ "3 = finish\n"
141
+ "4 = run_static_eval\n"
142
+ "5 = read_file"
143
+ ),
144
+ },
145
+ {
146
+ "role": "user",
147
+ "content": (
148
+ f"Steps remaining: {obs.remaining_steps}\n"
149
+ f"Last match score: {obs.last_match_score}\n"
150
+ f"Has valid edit: {obs.has_valid_edit}\n"
151
+ f"Has explicit match: {obs.has_run_match}\n"
152
+ f"Suggested actions: {', '.join(obs.suggested_actions)}\n"
153
+ f"Workflow hint: {obs.workflow_hint}\n"
154
+ f"History: {obs.history}\n\n"
155
+ f"Current eval.py:\n```python\n{eval_code}\n```\n\n"
156
+ "Choose the best next action ID. Return exactly one digit."
157
+ ),
158
+ },
159
+ ]
160
+
161
+
162
+ def format_write_messages(obs) -> list[dict[str, str]]:
163
+ """Ask the model to output a full replacement eval.py file only."""
164
+ eval_code = obs.file_contents.get("eval.py", "<not read yet>")
165
+ write_prefix = build_write_prefix(eval_code)
166
+ return [
167
+ {
168
+ "role": "system",
169
+ "content": (
170
+ "Continue a Python file for a Chess960 engine.\n"
171
+ "The assistant response is appended directly after a provided prefix.\n"
172
+ "Output only the remaining Python lines after the prefix.\n"
173
+ "Do not repeat the prefix. No markdown, no prose, no JSON, no <think>."
174
+ ),
175
+ },
176
+ {
177
+ "role": "user",
178
+ "content": (
179
+ f"Steps remaining: {obs.remaining_steps}\n"
180
+ f"Last match score: {obs.last_match_score}\n"
181
+ f"Workflow hint: {obs.workflow_hint}\n"
182
+ "Improve the evaluation function while keeping valid Python that defines evaluate(board).\n"
183
+ "You are completing the file after this exact prefix:\n\n"
184
+ f"```python\n{write_prefix}```"
185
+ ),
186
+ },
187
+ ]
188
+
189
+
190
+ def apply_action_chat_template(tokenizer, messages: list[dict[str, str]]) -> str:
191
+ """Apply Qwen chat template while disabling thinking when the template supports it."""
192
+ template_attempts = [
193
+ {"chat_template_kwargs": {"enable_thinking": False}},
194
+ {"enable_thinking": False},
195
+ {},
196
+ ]
197
+ for extra_kwargs in template_attempts:
198
+ try:
199
+ return tokenizer.apply_chat_template(
200
+ messages,
201
+ tokenize=False,
202
+ add_generation_prompt=True,
203
+ **extra_kwargs,
204
+ )
205
+ except TypeError:
206
+ continue
207
+ return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
208
+
209
+
210
+ def strip_reasoning(text: str) -> str:
211
+ """Remove common reasoning wrappers before JSON parsing."""
212
+ cleaned = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL | re.IGNORECASE)
213
+ cleaned = re.sub(r"<\|im_start\|>assistant\s*", "", cleaned)
214
+ cleaned = re.sub(r"<\|im_end\|>", "", cleaned)
215
+ return cleaned.strip()
216
+
217
+
218
+ def extract_python_source(text: str) -> str:
219
+ """Extract raw Python source from model output."""
220
+ cleaned = strip_reasoning(text)
221
+ fenced = re.findall(r"```(?:python)?\s*(.*?)\s*```", cleaned, re.DOTALL)
222
+ if fenced:
223
+ return fenced[0].strip()
224
+ return cleaned.strip()
225
+
226
+
227
+ def build_write_prefix(current_code: str) -> str:
228
+ """Build a stable file prefix that the model must continue."""
229
+ marker = "def evaluate(board: chess.Board) -> int:\n"
230
+ match = re.search(re.escape(marker), current_code)
231
+ if match:
232
+ return current_code[:match.end()] + " score = 0\n"
233
+ return (
234
+ "from __future__ import annotations\n\n"
235
+ "import chess\n\n"
236
+ "PIECE_VALUES = {\n"
237
+ " chess.PAWN: 100,\n"
238
+ " chess.KNIGHT: 320,\n"
239
+ " chess.BISHOP: 330,\n"
240
+ " chess.ROOK: 500,\n"
241
+ " chess.QUEEN: 900,\n"
242
+ " chess.KING: 0,\n"
243
+ "}\n\n"
244
+ "def evaluate(board: chess.Board) -> int:\n"
245
+ " score = 0\n"
246
+ )
247
+
248
+
249
+ def extract_python_continuation(text: str) -> str:
250
+ """Extract only indented Python lines for the evaluate() body continuation."""
251
+ cleaned = extract_python_source(text)
252
+ lines = cleaned.splitlines()
253
+ kept: list[str] = []
254
+ started = False
255
+
256
+ for line in lines:
257
+ if not started:
258
+ if not line.strip():
259
+ continue
260
+ if line.startswith(" ") or line.startswith("\t"):
261
+ started = True
262
+ kept.append(line)
263
+ continue
264
+ if re.match(r"(for|if|elif|else|while|return|score|white_|black_|center_|mobility_|piece_|pawn_|king_|board)", line.strip()):
265
+ started = True
266
+ kept.append(f" {line.strip()}")
267
+ continue
268
+ continue
269
+
270
+ if line.strip() and not (line.startswith(" ") or line.startswith("\t")):
271
+ break
272
+ kept.append(line)
273
+
274
+ return "\n".join(kept).rstrip() + "\n" if kept else ""
275
+
276
+
277
+ def fallback_eval_tail(current_code: str) -> str:
278
+ """Reuse the existing evaluate() body tail as a safe syntax fallback."""
279
+ marker = " score = 0\n"
280
+ index = current_code.find(marker)
281
+ if index == -1:
282
+ return " return score\n"
283
+ return current_code[index + len(marker):].rstrip() + "\n"
284
+
285
+
286
+ def choose_action_id(model, tokenizer, prompt: str) -> tuple[str, dict[str, float]]:
287
+ """Score a fixed set of action IDs and return the most likely one."""
288
+ import torch
289
+
290
+ prompt_inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
291
+ prompt_input_ids = prompt_inputs["input_ids"]
292
+ prompt_attention_mask = prompt_inputs["attention_mask"]
293
+ option_scores: dict[str, float] = {}
294
+
295
+ with torch.no_grad():
296
+ for option_id in ACTION_CHOICE_MAP:
297
+ option_tokens = tokenizer(option_id, add_special_tokens=False, return_tensors="pt")
298
+ option_input_ids = option_tokens["input_ids"].to(model.device)
299
+ option_attention_mask = option_tokens["attention_mask"].to(model.device)
300
+
301
+ full_input_ids = torch.cat([prompt_input_ids, option_input_ids], dim=1)
302
+ full_attention_mask = torch.cat([prompt_attention_mask, option_attention_mask], dim=1)
303
+ outputs = model(input_ids=full_input_ids, attention_mask=full_attention_mask)
304
+ log_probs = outputs.logits[:, prompt_input_ids.shape[1] - 1:-1, :].log_softmax(dim=-1)
305
+
306
+ token_log_prob = 0.0
307
+ for index in range(option_input_ids.shape[1]):
308
+ token_id = option_input_ids[0, index]
309
+ token_log_prob += float(log_probs[0, index, token_id])
310
+ option_scores[option_id] = token_log_prob / max(option_input_ids.shape[1], 1)
311
+
312
+ best_option = max(option_scores, key=option_scores.get)
313
+ return best_option, option_scores
314
+
315
+
316
+ def parse_action_choice(text: str) -> str:
317
+ """Parse a one-token action ID from a completion."""
318
+ cleaned = strip_reasoning(text)
319
+ digit_match = re.search(r"\b([1-5])\b", cleaned)
320
+ if digit_match:
321
+ return digit_match.group(1)
322
+
323
+ lowered = cleaned.lower()
324
+ for action_id, action_type in ACTION_CHOICE_MAP.items():
325
+ if action_type in lowered:
326
+ return action_id
327
+ return "3"
328
+
329
+
330
+ def build_training_write_code(current_code: str, variant_index: int = 0) -> str:
331
+ """Apply a deterministic valid edit so GRPO can learn the task loop first."""
332
+ candidates = [
333
+ (
334
+ "CENTER_ATTACK_BONUS = 3",
335
+ "CENTER_ATTACK_BONUS = 4",
336
+ ),
337
+ (
338
+ "BISHOP_PAIR_BONUS = 35",
339
+ "BISHOP_PAIR_BONUS = 45",
340
+ ),
341
+ (
342
+ "ROOK_OPEN_FILE_BONUS = 20",
343
+ "ROOK_OPEN_FILE_BONUS = 24",
344
+ ),
345
+ (
346
+ "PASSED_PAWN_BONUS_BY_RANK = [0, 5, 10, 18, 28, 42, 60, 0]",
347
+ "PASSED_PAWN_BONUS_BY_RANK = [0, 6, 12, 20, 32, 48, 68, 0]",
348
+ ),
349
+ ]
350
+
351
+ for offset in range(len(candidates)):
352
+ source, target = candidates[(variant_index + offset) % len(candidates)]
353
+ if source in current_code:
354
+ candidate_code = current_code.replace(source, target, 1)
355
+ try:
356
+ ast.parse(candidate_code, filename="eval.py")
357
+ except SyntaxError:
358
+ continue
359
+ if candidate_code != current_code:
360
+ return candidate_code
361
+ return current_code
362
+
363
+
364
+ def build_training_action(choice_id: str, obs, variant_index: int = 0) -> Zero960Action:
365
+ """Convert an action-choice completion into a concrete env action."""
366
+ action_type = ACTION_CHOICE_MAP.get(choice_id, "finish")
367
+ if action_type == "write_file":
368
+ current_code = obs.file_contents.get("eval.py", "")
369
+ content = build_training_write_code(current_code, variant_index=variant_index)
370
+ return Zero960Action(action_type="write_file", path="eval.py", content=content)
371
+ if action_type == "read_file":
372
+ return Zero960Action(action_type="read_file", path="eval.py")
373
+ return Zero960Action(action_type=action_type)
374
+
375
+
376
+ def generate_write_action(model, tokenizer, obs) -> tuple[Zero960Action, str]:
377
+ """Generate the full eval.py replacement after action type selection."""
378
+ current_code = obs.file_contents.get("eval.py", "")
379
+ write_prefix = build_write_prefix(current_code)
380
+ write_prompt = apply_action_chat_template(tokenizer, format_write_messages(obs)) + write_prefix
381
+ inputs = tokenizer(write_prompt, return_tensors="pt").to(model.device)
382
+ outputs = model.generate(
383
+ **inputs,
384
+ max_new_tokens=256,
385
+ do_sample=False,
386
+ )
387
+ generated = tokenizer.decode(
388
+ outputs[0][inputs["input_ids"].shape[1]:],
389
+ skip_special_tokens=True,
390
+ )
391
+ continuation = extract_python_continuation(generated)
392
+ if not continuation:
393
+ continuation = fallback_eval_tail(current_code)
394
+ code = write_prefix + continuation
395
+ try:
396
+ ast.parse(code, filename="eval.py")
397
+ except SyntaxError:
398
+ code = write_prefix + fallback_eval_tail(current_code)
399
+ return Zero960Action(action_type="write_file", path="eval.py", content=code), generated
400
+
401
+
402
+ def choose_structured_action(
403
+ model,
404
+ tokenizer,
405
+ obs,
406
+ deterministic_write: bool = False,
407
+ ) -> tuple[Zero960Action, dict[str, float], str | None]:
408
+ """Choose action type via fixed-option scoring, then generate code only if needed."""
409
+ action_prompt = apply_action_chat_template(tokenizer, format_action_selection_messages(obs))
410
+ action_id, scores = choose_action_id(model, tokenizer, action_prompt)
411
+ adjusted_scores = dict(scores)
412
+
413
+ # Make the policy respect the environment workflow instead of repeatedly editing.
414
+ if obs.has_valid_edit and not obs.has_run_match:
415
+ adjusted_scores["2"] += 3.0
416
+ adjusted_scores["1"] -= 2.0
417
+ adjusted_scores["5"] -= 1.5
418
+ adjusted_scores["4"] -= 1.5
419
+ elif obs.has_run_match:
420
+ if obs.last_match_score is not None and (obs.last_match_score >= 0.25 or obs.remaining_steps <= 2):
421
+ adjusted_scores["3"] += 2.5
422
+ adjusted_scores["1"] -= 1.0
423
+ adjusted_scores["5"] -= 1.0
424
+ adjusted_scores["4"] -= 1.0
425
+ else:
426
+ adjusted_scores["1"] += 1.0
427
+ adjusted_scores["3"] += 0.5
428
+
429
+ action_id = max(adjusted_scores, key=adjusted_scores.get)
430
+ action_type = ACTION_CHOICE_MAP[action_id]
431
+ if action_type == "write_file":
432
+ if deterministic_write:
433
+ action = build_training_action("1", obs, variant_index=max(obs.remaining_steps, 0))
434
+ return action, adjusted_scores, "[deterministic write template]"
435
+ action, raw_code_output = generate_write_action(model, tokenizer, obs)
436
+ return action, adjusted_scores, raw_code_output
437
+ if action_type == "read_file":
438
+ return Zero960Action(action_type="read_file", path="eval.py"), adjusted_scores, None
439
+ return Zero960Action(action_type=action_type), adjusted_scores, None
440
+
441
+
442
+ def _extract_balanced_json_objects(text: str) -> list[str]:
443
+ """Return brace-balanced JSON object candidates from free-form model output."""
444
+ candidates: list[str] = []
445
+ start: int | None = None
446
+ depth = 0
447
+ in_string = False
448
+ escape = False
449
+
450
+ for index, char in enumerate(text):
451
+ if start is None:
452
+ if char == "{":
453
+ start = index
454
+ depth = 1
455
+ in_string = False
456
+ escape = False
457
+ continue
458
+
459
+ if in_string:
460
+ if escape:
461
+ escape = False
462
+ elif char == "\\":
463
+ escape = True
464
+ elif char == '"':
465
+ in_string = False
466
+ continue
467
+
468
+ if char == '"':
469
+ in_string = True
470
+ elif char == "{":
471
+ depth += 1
472
+ elif char == "}":
473
+ depth -= 1
474
+ if depth == 0:
475
+ candidates.append(text[start:index + 1])
476
+ start = None
477
+
478
+ return candidates
479
+
480
+
481
  def parse_llm_output(text: str) -> Zero960Action:
482
  """Best-effort parse of LLM output into a Zero960Action."""
483
+ cleaned = strip_reasoning(text)
484
+ fenced_match = re.findall(r"```(?:json)?\s*(\{.*?\})\s*```", cleaned, re.DOTALL)
485
+ for candidate in fenced_match + _extract_balanced_json_objects(cleaned):
 
 
 
 
 
 
 
 
486
  try:
487
+ data = json.loads(candidate)
488
  return Zero960Action(**data)
489
  except (json.JSONDecodeError, ValueError):
490
+ continue
491
+
492
+ action_match = re.search(r'"action_type"\s*:\s*"([^"]+)"', cleaned)
493
+ if action_match:
494
+ action_type = action_match.group(1)
495
+ if action_type in {"run_static_eval", "run_match", "finish"}:
496
+ return Zero960Action(action_type=action_type)
497
+
498
+ lowered = cleaned.lower()
499
+ if "run_match" in lowered:
500
+ return Zero960Action(action_type="run_match")
501
+ if "write_file" in lowered and "eval.py" in lowered:
502
+ return Zero960Action(action_type="read_file", path="eval.py")
503
  return Zero960Action(action_type="finish")
504
 
505
 
 
515
 
516
 
517
  def run_handcrafted_rollout(base_url: str) -> RolloutSummary:
518
+ """Quick demo: apply a tiny valid edit, run a match, then finish."""
519
  with Zero960Client(base_url=base_url) as client:
520
  result = client.reset()
521
  obs = result.observation
522
 
523
+ current_code = obs.file_contents["eval.py"]
524
+ edited_code = current_code.replace("score += 15 *", "score += 20 *", 1)
525
+ result = client.step(
526
+ Zero960Action(
527
+ action_type="write_file",
528
+ path="eval.py",
529
+ content=edited_code,
530
+ )
531
+ )
532
  obs = result.observation
533
 
 
 
 
 
534
  if obs.remaining_steps > 1:
535
  result = client.step(Zero960Action(action_type="run_match"))
536
  obs = result.observation
 
552
  def run_inference_test(
553
  base_url: str,
554
  model_name: str = "Qwen/Qwen3.5-9B",
555
+ tokenizer_name: str | None = None,
556
  max_episode_steps: int = 6,
557
+ deterministic_write: bool = True,
558
  ) -> RolloutSummary:
559
  """Run a single episode with Qwen generating actions against the live env."""
560
  from transformers import AutoModelForCausalLM, AutoTokenizer
561
 
562
  print(f"Loading {model_name}...")
563
+ tokenizer = AutoTokenizer.from_pretrained(tokenizer_name or model_name)
564
  model = AutoModelForCausalLM.from_pretrained(
565
  model_name, torch_dtype="auto", device_map="auto",
566
  )
 
573
  if result.done:
574
  break
575
 
576
+ print(f"\n--- Step {step_i + 1} ---")
577
+ action, action_scores, raw_code_output = choose_structured_action(
578
+ model,
579
+ tokenizer,
580
+ obs,
581
+ deterministic_write=deterministic_write,
 
 
582
  )
583
+ score_text = ", ".join(
584
+ f"{choice}:{score:.3f}" for choice, score in sorted(action_scores.items())
 
585
  )
586
+ print(f"Action scores: {score_text}")
 
 
 
 
587
  print(f"Parsed action: {action.action_type}", end="")
588
  if action.path:
589
  print(f" path={action.path}", end="")
590
  print()
591
+ if raw_code_output is not None:
592
+ print(f"Write output: {raw_code_output[:300]}...")
593
 
594
  result = client.step(action)
595
  obs = result.observation
596
  print(f"Status: {obs.status_message}")
597
+ print(f"Step reward: {result.reward}")
598
 
599
  if not result.done:
600
  result = client.step(Zero960Action(action_type="finish"))
 
642
  prompts = []
643
  for _ in range(n):
644
  result = env.reset()
645
+ msgs = format_action_selection_messages(result.observation)
646
+ prompt_text = apply_action_chat_template(tokenizer, msgs)
 
 
647
  prompts.append(prompt_text)
648
  return Dataset.from_dict({"prompt": prompts})
649
 
 
673
  "gen": i,
674
  "completion_preview": completion[:500],
675
  "completion_len": len(completion),
676
+ "step_rewards": [],
677
  }
678
  try:
679
  result = env.reset()
680
  obs = result.observation
681
 
682
+ choice_id = parse_action_choice(completion)
683
+ action = build_training_action(choice_id, obs, variant_index=step_n + i)
684
+ entry["choice_id"] = choice_id
685
  entry["parsed_action"] = action.action_type
686
  entry["parsed_path"] = action.path
687
  if action.action_type == "write_file" and action.content:
688
  entry["code_preview"] = action.content[:300]
689
  entry["code_len"] = len(action.content)
690
+ entry["code_changed"] = action.content != obs.file_contents.get("eval.py", "")
691
 
692
  result = env.step(action)
693
  obs = result.observation
694
  entry["env_status_1"] = obs.status_message
695
+ entry["step_rewards"].append(float(result.reward or 0.0))
696
 
697
  # If the model wrote code, run a match to get a real score
698
  if not result.done and action.action_type == "write_file":
699
  result = env.step(Zero960Action(action_type="run_match"))
700
  obs = result.observation
701
  entry["match_score"] = obs.last_match_score
702
+ entry["step_rewards"].append(float(result.reward or 0.0))
703
 
704
  # Finish to get terminal reward
705
  if not result.done:
706
  result = env.step(Zero960Action(action_type="finish"))
707
+ entry["step_rewards"].append(float(result.reward or 0.0))
708
 
709
+ reward = float(result.reward or 0.0) + TRAIN_ACTION_REWARD_BIAS[action.action_type]
710
  rewards.append(reward)
711
  entry["reward"] = reward
712
+ entry["reward_bias"] = TRAIN_ACTION_REWARD_BIAS[action.action_type]
713
  except Exception as exc:
714
  rewards.append(0.0)
715
  entry["reward"] = 0.0
 
768
  learning_rate=5e-6,
769
  logging_steps=1,
770
  num_generations=num_generations,
771
+ max_completion_length=4,
772
  bf16=True,
773
+ gradient_checkpointing=False,
 
774
  report_to="none",
775
  )
776
 
 
824
  )
825
  parser.add_argument("--base-url", default="http://127.0.0.1:8000")
826
  parser.add_argument("--model", default="Qwen/Qwen3.5-9B")
827
+ parser.add_argument(
828
+ "--tokenizer",
829
+ default=None,
830
+ help="Optional tokenizer path/name for infer mode when loading a checkpoint without tokenizer files.",
831
+ )
832
  parser.add_argument("--steps", type=int, default=20)
833
  parser.add_argument("--num-generations", type=int, default=4)
834
  parser.add_argument("--max-turns", type=int, default=6)
 
845
  summary = run_inference_test(
846
  base_url=args.base_url,
847
  model_name=args.model,
848
+ tokenizer_name=args.tokenizer,
849
  )
850
  print({
851
  "reward": summary.reward,
train/sft_student.py ADDED
@@ -0,0 +1,243 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Supervised fine-tuning for a bounded-action 0x960 student policy."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import argparse
6
+ import glob
7
+ import json
8
+ import random
9
+ from collections import Counter
10
+ from pathlib import Path
11
+
12
+ from datasets import Dataset
13
+ import torch
14
+ from transformers import AutoModelForCausalLM, AutoTokenizer
15
+ from trl import SFTConfig, SFTTrainer
16
+
17
+
18
+ ALLOWED_ACTION_KEYS = {"action_type", "path", "content"}
19
+
20
+
21
+ def _resolve_input_paths(explicit_paths: list[str], data_glob: str) -> list[Path]:
22
+ paths = [Path(path) for path in explicit_paths]
23
+ paths.extend(Path(path) for path in glob.glob(data_glob))
24
+ unique_paths = sorted({path.resolve() for path in paths if path.exists()})
25
+ if not unique_paths:
26
+ raise FileNotFoundError(
27
+ "no SFT data files found; pass --data-path or adjust --data-glob"
28
+ )
29
+ return unique_paths
30
+
31
+
32
+ def _validate_record(payload: dict, source_path: Path, line_number: int) -> dict | None:
33
+ messages = payload.get("messages")
34
+ metadata = payload.get("metadata", {})
35
+ if not isinstance(messages, list) or len(messages) != 3:
36
+ return None
37
+
38
+ roles = [message.get("role") for message in messages if isinstance(message, dict)]
39
+ if roles != ["system", "user", "assistant"]:
40
+ return None
41
+
42
+ assistant_content = messages[-1].get("content")
43
+ if not isinstance(assistant_content, str):
44
+ return None
45
+
46
+ try:
47
+ action_payload = json.loads(assistant_content)
48
+ except json.JSONDecodeError:
49
+ return None
50
+
51
+ if not isinstance(action_payload, dict):
52
+ return None
53
+ if set(action_payload) != ALLOWED_ACTION_KEYS:
54
+ return None
55
+ if action_payload["action_type"] not in {
56
+ "read_file",
57
+ "write_file",
58
+ "run_static_eval",
59
+ "run_match",
60
+ "finish",
61
+ }:
62
+ return None
63
+
64
+ final_reward = metadata.get("final_reward")
65
+ if final_reward is not None:
66
+ final_reward = float(final_reward)
67
+
68
+ return {
69
+ "messages": messages,
70
+ "metadata": {
71
+ "source_path": str(source_path),
72
+ "line_number": line_number,
73
+ "episode_index": metadata.get("episode_index"),
74
+ "turn_index": metadata.get("turn_index"),
75
+ "teacher_model": metadata.get("teacher_model"),
76
+ "final_reward": final_reward,
77
+ },
78
+ "action_type": action_payload["action_type"],
79
+ }
80
+
81
+
82
+ def load_sft_records(
83
+ input_paths: list[Path],
84
+ min_final_reward: float,
85
+ max_examples: int | None,
86
+ seed: int,
87
+ ) -> tuple[list[dict], dict]:
88
+ records: list[dict] = []
89
+ skipped_invalid = 0
90
+ skipped_low_reward = 0
91
+ dedupe_keys: set[str] = set()
92
+
93
+ for input_path in input_paths:
94
+ for line_number, line in enumerate(input_path.read_text().splitlines(), start=1):
95
+ if not line.strip():
96
+ continue
97
+ payload = json.loads(line)
98
+ record = _validate_record(payload, input_path, line_number)
99
+ if record is None:
100
+ skipped_invalid += 1
101
+ continue
102
+ final_reward = record["metadata"]["final_reward"]
103
+ if final_reward is not None and final_reward < min_final_reward:
104
+ skipped_low_reward += 1
105
+ continue
106
+
107
+ dedupe_key = json.dumps(record["messages"], sort_keys=True)
108
+ if dedupe_key in dedupe_keys:
109
+ continue
110
+ dedupe_keys.add(dedupe_key)
111
+ records.append(record)
112
+
113
+ random.Random(seed).shuffle(records)
114
+ if max_examples is not None:
115
+ records = records[:max_examples]
116
+
117
+ stats = {
118
+ "input_files": [str(path) for path in input_paths],
119
+ "records_kept": len(records),
120
+ "skipped_invalid": skipped_invalid,
121
+ "skipped_low_reward": skipped_low_reward,
122
+ "action_counts": dict(Counter(record["action_type"] for record in records)),
123
+ }
124
+ return records, stats
125
+
126
+
127
+ def split_records(records: list[dict], eval_fraction: float) -> tuple[list[dict], list[dict]]:
128
+ if not records:
129
+ return [], []
130
+ if eval_fraction <= 0 or len(records) < 10:
131
+ return records, []
132
+ eval_size = max(1, int(len(records) * eval_fraction))
133
+ if eval_size >= len(records):
134
+ eval_size = len(records) - 1
135
+ return records[eval_size:], records[:eval_size]
136
+
137
+
138
+ def build_dataset(records: list[dict]) -> Dataset:
139
+ return Dataset.from_list(
140
+ [
141
+ {
142
+ "messages": record["messages"],
143
+ "metadata": record["metadata"],
144
+ }
145
+ for record in records
146
+ ]
147
+ )
148
+
149
+
150
+ def parse_args() -> argparse.Namespace:
151
+ parser = argparse.ArgumentParser(description="SFT a bounded-action 0x960 student model.")
152
+ parser.add_argument("--model", default="Qwen/Qwen3.5-0.8B")
153
+ parser.add_argument("--data-path", action="append", default=[])
154
+ parser.add_argument("--data-glob", default="outputs/codex_distill/sft_samples_*.jsonl")
155
+ parser.add_argument("--output-dir", default="outputs/sft_student")
156
+ parser.add_argument("--min-final-reward", type=float, default=0.4)
157
+ parser.add_argument("--max-examples", type=int, default=None)
158
+ parser.add_argument("--eval-fraction", type=float, default=0.1)
159
+ parser.add_argument("--seed", type=int, default=42)
160
+ parser.add_argument("--per-device-train-batch-size", type=int, default=1)
161
+ parser.add_argument("--per-device-eval-batch-size", type=int, default=1)
162
+ parser.add_argument("--gradient-accumulation-steps", type=int, default=8)
163
+ parser.add_argument("--learning-rate", type=float, default=2e-5)
164
+ parser.add_argument("--num-train-epochs", type=float, default=3.0)
165
+ parser.add_argument("--max-steps", type=int, default=-1)
166
+ parser.add_argument("--logging-steps", type=int, default=5)
167
+ parser.add_argument("--save-total-limit", type=int, default=2)
168
+ parser.add_argument("--max-length", type=int, default=1024)
169
+ parser.add_argument("--assistant-only-loss", action="store_true")
170
+ parser.add_argument("--dry-run", action="store_true")
171
+ return parser.parse_args()
172
+
173
+
174
+ def main() -> None:
175
+ args = parse_args()
176
+ input_paths = _resolve_input_paths(args.data_path, args.data_glob)
177
+ records, stats = load_sft_records(
178
+ input_paths=input_paths,
179
+ min_final_reward=args.min_final_reward,
180
+ max_examples=args.max_examples,
181
+ seed=args.seed,
182
+ )
183
+ if not records:
184
+ raise RuntimeError("no usable SFT rows found after validation and filtering")
185
+
186
+ train_records, eval_records = split_records(records, args.eval_fraction)
187
+ stats["train_records"] = len(train_records)
188
+ stats["eval_records"] = len(eval_records)
189
+ print(stats)
190
+
191
+ if args.dry_run:
192
+ return
193
+
194
+ output_dir = Path(args.output_dir)
195
+ output_dir.mkdir(parents=True, exist_ok=True)
196
+
197
+ tokenizer = AutoTokenizer.from_pretrained(args.model)
198
+ if tokenizer.pad_token is None:
199
+ tokenizer.pad_token = tokenizer.eos_token
200
+
201
+ use_cuda = torch.cuda.is_available()
202
+ use_bf16 = use_cuda and torch.cuda.is_bf16_supported()
203
+ model_kwargs = {"torch_dtype": torch.bfloat16} if use_bf16 else {}
204
+ if tokenizer.padding_side != "right":
205
+ tokenizer.padding_side = "right"
206
+
207
+ train_dataset = build_dataset(train_records)
208
+ eval_dataset = build_dataset(eval_records) if eval_records else None
209
+
210
+ trainer = SFTTrainer(
211
+ model=AutoModelForCausalLM.from_pretrained(args.model, **model_kwargs),
212
+ args=SFTConfig(
213
+ output_dir=str(output_dir),
214
+ per_device_train_batch_size=args.per_device_train_batch_size,
215
+ per_device_eval_batch_size=args.per_device_eval_batch_size,
216
+ gradient_accumulation_steps=args.gradient_accumulation_steps,
217
+ learning_rate=args.learning_rate,
218
+ num_train_epochs=args.num_train_epochs,
219
+ max_steps=args.max_steps,
220
+ logging_steps=args.logging_steps,
221
+ save_strategy="epoch",
222
+ eval_strategy="epoch" if eval_dataset is not None else "no",
223
+ save_total_limit=args.save_total_limit,
224
+ report_to="none",
225
+ bf16=use_bf16,
226
+ gradient_checkpointing=use_cuda,
227
+ assistant_only_loss=args.assistant_only_loss,
228
+ max_length=args.max_length,
229
+ remove_unused_columns=False,
230
+ dataset_num_proc=1,
231
+ seed=args.seed,
232
+ ),
233
+ train_dataset=train_dataset,
234
+ eval_dataset=eval_dataset,
235
+ processing_class=tokenizer,
236
+ )
237
+ trainer.train()
238
+ trainer.save_model(str(output_dir / "final"))
239
+ tokenizer.save_pretrained(str(output_dir / "final"))
240
+
241
+
242
+ if __name__ == "__main__":
243
+ main()
uv.lock ADDED
The diff for this file is too large to render. See raw diff