File size: 11,846 Bytes
4ed9a84
67b703b
4ed9a84
67b703b
4ed9a84
67b703b
4ed9a84
67b703b
4ed9a84
 
 
67b703b
4ed9a84
67b703b
4ed9a84
 
 
eac9d9f
4ed9a84
eac9d9f
4ed9a84
 
 
eac9d9f
4ed9a84
eac9d9f
4ed9a84
 
 
eac9d9f
4ed9a84
eac9d9f
4ed9a84
 
eac9d9f
4ed9a84
eac9d9f
4ed9a84
 
 
 
 
 
67b703b
4ed9a84
67b703b
4ed9a84
 
 
 
 
67b703b
4ed9a84
67b703b
4ed9a84
 
 
67b703b
4ed9a84
67b703b
4ed9a84
 
 
 
 
67b703b
4ed9a84
67b703b
4ed9a84
 
 
 
 
 
67b703b
4ed9a84
67b703b
4ed9a84
 
 
 
67b703b
4ed9a84
ea3bbb3
4ed9a84
 
 
 
ea3bbb3
4ed9a84
8da9024
4ed9a84
 
 
 
8da9024
4ed9a84
93f58fd
4ed9a84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
# Process Log

Chronological build log for 0x960. Every entry is a meaningful work block with decisions, results, and next steps.

---

## 2026-03-07 20:10 EST — Project Kickoff

- Scoped the hackathon MVP: a model that improves a Chess960 engine through bounded code edits in an OpenEnv environment.
- Cut everything that didn't serve the core claim: no OAuth, no broad repo editing, no frontier-model dependence at runtime.
- Locked the stack: OpenEnv 0.2.1, HF Spaces, TRL/Unsloth-compatible training.

## 2026-03-07 20:20 EST — Docs & Architecture

- Rewrote all project docs to match the narrowed scope.
- Set the editable surface to `eval.py` with structured bounded actions.
- Added `docs/architecture.md`, `docs/why_chess960.md`, `docs/demo-script.md`.

## 2026-03-07 20:35 EST — Scaffolding & Engine Core

- Built the full source layout under `src/`.
- Implemented Chess960 engine core: board representation, default evaluation, search routine, workspace template for bounded edits.
- Episode runtime: resets fresh workspace, accepts structured actions, computes reward from match outcomes.

## 2026-03-07 20:42 EST — OpenEnv Integration

- Wired up OpenEnv-facing models, FastAPI server, and minimal rollout script.
- Kept this layer intentionally thin — core environment logic stays in the local runtime package.
- Verified syntax compilation across `src/` and `train/`.

## 2026-03-07 20:52 EST — Git & GitHub

- Initialized git, created batched local history.
- Created and pushed GitHub repo at `qtzx06/0x960`.

## 2026-03-07 21:15 EST — OpenEnv 0.2.1 API Integration

- Fixed all API mismatches against actual `openenv-core` 0.2.1 signatures.
- Models now extend proper `Action` and `Observation` base classes.
- Replaced `HTTPEnvClient` with WebSocket-based `EnvClient`.
- Added `openenv.yaml` manifest and `Dockerfile` for HF Spaces.
- Rewrote training script with `--mode handcrafted` and `--mode train`.
- **First end-to-end success: server starts, `/health` OK, handcrafted rollout completes with reward=0.125.**

## 2026-03-07 21:30 EST — H100 Deployment & First Model Run

- Made repo public.
- Set up Northflank H100 (80GB HBM3) with PyTorch 2.8.0 + CUDA 12.6.
- Ran Qwen 3.5-9B inference against live environment on H100.
- Model downloaded 19.3GB weights at ~3.3GB/s, loaded in ~7s.
- **Result: model ran 2-step episode (eval → finish), reward=0.25. Reasons about chess but never writes code.**

## 2026-03-07 21:45 EST — TRL GRPO Architecture

- Evaluated agent harness options: ACP, smolagents, SkyRL+OpenEnv, TRL+OpenEnv.
- Chose TRL's native OpenEnv integration (`rollout_func` + `generate_rollout_completions`).
- Rewrote training script to use TRL's multi-turn episode pattern.

## 2026-03-07 22:00 EST — QLoRA on H100

- Full-parameter GRPO OOMed on 80GB H100 (9B model).
- Switched to QLoRA: 4-bit NF4 + LoRA r=16 on attention + MLP projections.
- Discovered Qwen3.5 hybrid architecture: `self_attn` + `linear_attn` + `partial_rotary_factor=0.25`.
- Fixed shape mismatch in gradient checkpointing, dropped vLLM (version conflict).
- **Training running on H100 with QLoRA + gradient checkpointing.**

## 2026-03-08 00:30 EST — The Critical Diagnosis

- Analyzed rollout logs and identified the real failure mode: **behavioral, not architectural.**
- Agent collapsed into `run_static_eval` spam or premature `finish` — never attempted code edits.
- This was the key insight: raw RL can't optimize a policy that doesn't explore the right actions.
- Redesigned reward shaping: valid writes get bonuses, repeated evals get penalties, finishing without editing is heavily penalized.
- Added workflow hints and suggested next actions to observations.
- Replaced brittle regex JSON parser with brace-balanced extractor.

## 2026-03-08 02:10 EST — Teacher Distillation Pipeline

- Built `train/codex_distill.py` — teacher trajectory collector using GPT-5.4.
- Teacher constrained to same bounded JSON action schema as student.
- Collected successful trajectories and exported SFT-ready chat samples.
- Added generated training artifacts to `.gitignore`.

## 2026-03-08 04:20 EST — Student SFT Pipeline

- Built `train/sft_student.py` — minimal student fine-tuning entrypoint.
- Conversational dataset with `assistant_only_loss` — student learns the teacher's bounded actions, not the prompt text.
- Validates assistant action payloads, drops malformed rows, deduplicates.
- Dry-run mode for inspecting corpus before real training.

## 2026-03-08 05:05 EST — Engine Evaluation Upgrade

- Replaced toy default eval with production-quality Chess960-safe heuristic.
- Added: pawn structure, mobility, center control, rook-file activity, king safety, castling rights, bishop pair, development terms.
- Added move ordering to search: captures, checks, promotions, castling prioritized.
- Built `train/benchmark_eval.py` — held-out Chess960 benchmark with Elo delta estimation.

## 2026-03-08 05:40 EST — Student SFT Training (H100)

- Ran first remote SFT job on Northflank H100.
- Corpus: 105 clean rows / 35 successful episodes. Training time: ~5 minutes.
- **Final SFT metrics: train loss 0.2072, eval loss 0.04192, token accuracy 98.76%.**
- **Before vs after:**
  - Base Qwen 3.5-0.8B: spammed `run_static_eval` for all 6 steps, reward **-2.1**
  - SFT student: executed `write_file → run_match → finish` in 3 steps, reward **+1.0**
- The teacher-student path solved action discovery completely.

## 2026-03-08 06:05 EST — Codex Swarm Plan

- Designed champion/challenger engine-iteration loop using multiple Codex workers.
- OpenEnv remains the core submission artifact; Codex swarm is the outer optimization layer.
- Local workers as default orchestration path for fast debugging.

## 2026-03-08 06:40 EST — Codex Swarm Implementation

- Built `train/codex_swarm.py` — runnable local coordinator.
- Champion eval snapshot, worker sandboxes under `/tmp/0x960-codex-swarm/`.
- Git worktree first, lightweight clone fallback.
- Promotion ledger, per-worker JSON results, round logging.
- Refactored `train/benchmark_eval.py` into reusable library surface.

## 2026-03-08 06:58 EST — UCI Anchor Benchmark

- Built `train/benchmark_uci.py` — benchmark against external engines like Stockfish.
- Calibrated anchors with `UCI_LimitStrength=true` and `UCI_Elo` settings.
- **First external ladder: worker-1 patch scored 4.5/8 vs Stockfish 1320, 2.0/8 vs Stockfish 1600.**

## 2026-03-08 07:20 EST — Swarm Tightening

- Narrowed from undirected agents to explicit heuristic lanes: king safety, tactical pressure, piece activity, pawn/rook structure.
- Added per-worker timeout enforcement.
- Workers make one bounded patch and stop; coordinator handles all benchmarking.
- Added `--max-diff-lines` enforcement (default 80) to reject noisy whole-file rewrites.

## 2026-03-08 07:34 EST — League Self-Play Benchmark

- Built `train/benchmark_league.py` — league-style self-play against original baseline plus all accepted champion history.
- Skips mirror matches, skips byte-identical snapshots.
- First league run showed champion splitting: strong vs original baseline, competitive vs accepted history.

## 2026-03-08 07:45 EST — Results Dashboard

- Built `train/build_dashboard.py` — static HTML dashboard generator.
- Visualizes: champion progression, Elo deltas, swarm results, league self-play, Stockfish anchor bars.
- Self-contained HTML + JSON, no framework needed.

## 2026-03-08 08:00 EST — Swarm Hook Architecture

- Refactored champion eval into explicit hook lanes: `_structure_hook`, `_tactical_hook`, `_activity_hook`, `_pawn_endgame_hook`, `_initiative_hook`.
- Each worker role bound to one named hook.
- Fixed diff gate to measure against frozen champion snapshot (not repo HEAD).
- Added staged benchmark funnel: cheap 8-position screen → final 16-position benchmark only for screen winner.

## 2026-03-08 08:28 EST — First Swarm Promotion

- **worker-2 patched `_tactical_hook`, screened at 0.656 over 16 games, held 0.578 over 32 games: +54.7 Elo.**
- Promoted to champion. Saved accepted snapshot.
- Added round scheduler that prioritizes underdeveloped hook lanes automatically.

## 2026-03-08 08:54 EST — Classical Search Overhaul

- Eval stacking hit diminishing returns. Pivoted to classical search quality.
- Added quiescence search (captures + promotions at depth-0 leaves).
- Added transposition table with probe/store.
- Added killer-move and history-heuristic move ordering.
- **Internal engine-vs-engine: 15.5/16 games, score=0.969, +596.5 Elo vs search baseline.**
- **External anchor: 12.5/16 vs Stockfish 1320, score=0.781, +221.1 Elo.**

## 2026-03-08 10:08 EST — Search Surface Swarm

- Extended Codex swarm with `--surface search` mode.
- Search workers edit only `search.py`, targeting specific functions.
- Added `champion_search.py`, per-round baseline engine snapshots.
- Full engine-vs-engine benchmark for search-surface promotion.

## 2026-03-08 11:14 EST — Selective Root Deepening

- When root is in check, move count <= 12, or low-material endgame: search one extra ply.
- Opening roots stayed fast (~0.07s to 0.11s on sampled starts).
- Synced best eval into workspace template.

## 2026-03-08 11:28 EST — Quiescence Fix + Pawn Structure

- Fixed search bug: quiescence no longer uses stand-pat when in check; now searches all legal evasions.
- Filled `_structure_hook` with pawn coordination terms: connected pawns, chains, central duos, phase-weighted.

## 2026-03-08 11:36 EST — Persistent Transposition Table

- Module-level TT persists across `select_move` calls within the same game.
- TT best move used at root for move ordering.
- **Mid-game plies improved from ~0.62-1.03s to ~0.32-0.63s (roughly 40% faster).**

## 2026-03-08 11:44 EST — Principal Variation Search (PVS)

- First move at each node: full window. Later moves: zero-window first, full re-search only on fail-high.
- **Later plies improved from ~0.32-0.63s to ~0.25-0.46s (another 30% speed gain).**

## 2026-03-08 11:51 EST — Opening Depth Extension

- Trade PVS speed wins for deeper opening search.
- If `fullmove_number <= 2` and `<= 24` legal moves, search one extra ply at root.
- Move choices changed from PVS-only run — intended effect.

## 2026-03-08 12:00 EST — Null-Move Pruning + Persistent History

- Null-move pruning for non-check, non-endgame nodes at depth >= 3.
- Persistent history ordering across moves within same game.
- Operating envelope stable: opening plies ~0.86-1.21s, later plies ~0.10-0.72s.

## 2026-03-08 12:08 EST — LMR + Aspiration Windows

- Late Move Reductions: later quiet moves at depth >= 3 searched at reduced ply first.
- Aspiration windows at root, seeded from persistent TT score.
- Tried and reverted quiescence delta pruning (made early plies worse).
- **Final search stack: alpha-beta negamax, quiescence with in-check evasions, TT probe/store, persistent TT reuse, TT root move ordering, persistent history, killer ordering, PVS, null-move pruning, LMR, aspiration windows, selective root extensions.**

## 2026-03-08 12:20 EST — Swarm Promotions Continue

- 4 total eval champions promoted through the tournament gate.
- Search-surface acceptance on latest round.
- Current engine: **competitive with Stockfish 1600 in local Chess960 setup.**

## 2026-03-08 12:45 EST — Submission Artifacts

- Generated submission media: score progression chart, Stockfish anchor bars.
- Built final dashboard with all results.
- Finalized swarm tooling and submission artifacts.

## 2026-03-08 13:00 EST — Final Polish

- Rewrote README: results-first layout with Elo gains, architecture diagram, concrete metrics.
- Updated training docs with GRPO deep-dive and full hackathon timeline.
- Sharpened demo script with before/after numbers.
- All artifacts committed and pushed.