CreativeEngineer commited on
Commit
f09f601
·
1 Parent(s): e815b38

docs: add verifier reward review

Browse files
docs/VERIFIER_REWARD_REVIEW_2026-03-08.md ADDED
@@ -0,0 +1,281 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Verifier And Reward Review
2
+
3
+ Date: 2026-03-08
4
+
5
+ ## Scope
6
+
7
+ This note reviews how well the current verifier path and reward function serve the repo's stated goal:
8
+
9
+ - ship one clear, reproducible OpenEnv environment for budget-constrained stellarator design
10
+ - keep the verifier official
11
+ - keep the task human-playable
12
+ - keep reward explainable and iteratively improved
13
+
14
+ Primary review targets:
15
+
16
+ - [docs/FUSION_DESIGN_LAB_PLAN_V2.md](/Users/suhjungdae/code/fusion-design-lab/docs/FUSION_DESIGN_LAB_PLAN_V2.md)
17
+ - [docs/P1_ENV_CONTRACT_V1.md](/Users/suhjungdae/code/fusion-design-lab/docs/P1_ENV_CONTRACT_V1.md)
18
+ - [server/physics.py](/Users/suhjungdae/code/fusion-design-lab/server/physics.py)
19
+ - [server/environment.py](/Users/suhjungdae/code/fusion-design-lab/server/environment.py)
20
+ - recent history from `d58c100` through `6deaccc`
21
+
22
+ ## Executive Summary
23
+
24
+ The verifier implementation is directionally strong and mostly correct now. The major March 7, 2026 commit sequence fixed the important correctness problems:
25
+
26
+ - replaced the synthetic evaluator with real `constellaration` / `GeometricalProblem`
27
+ - repaired the blocked 3-knob family by adding explicit triangularity control
28
+ - separated low-fidelity `run` truth from high-fidelity `submit` truth
29
+ - fixed the post-VMEC-failure recovery reward bug
30
+
31
+ The reward is also directionally good. `Reward V0` remains simple, mostly verifier-driven, and aligned with the repo docs.
32
+
33
+ The main gap is no longer basic verifier wiring. The main gap is that the repo still lacks the validation evidence that its own docs require before calling the environment "good":
34
+
35
+ - no tracked fixtures
36
+ - no manual playtest log
37
+ - no documented reward pathology/fix or explicit "Reward V0 survived playtest" note
38
+
39
+ ## Validated Findings
40
+
41
+ ### 1. Missing validation artifacts are still the biggest repo-level gap
42
+
43
+ The planning docs explicitly say the environment artifact is the product, not just the code. Current repo state still lacks:
44
+
45
+ - tracked `P1` fixtures
46
+ - manual playtest evidence
47
+ - a documented `Reward V0 -> V1` change, or an explicit record that `Reward V0` survived playtesting unchanged
48
+
49
+ Relevant references:
50
+
51
+ - [docs/FUSION_DESIGN_LAB_PLAN_V2.md](/Users/suhjungdae/code/fusion-design-lab/docs/FUSION_DESIGN_LAB_PLAN_V2.md#L36)
52
+ - [docs/FUSION_DESIGN_LAB_PLAN_V2.md](/Users/suhjungdae/code/fusion-design-lab/docs/FUSION_DESIGN_LAB_PLAN_V2.md#L420)
53
+ - [docs/FUSION_DELIVERABLES_MAP.md](/Users/suhjungdae/code/fusion-design-lab/docs/FUSION_DELIVERABLES_MAP.md#L21)
54
+ - [server/data/p1/README.md](/Users/suhjungdae/code/fusion-design-lab/server/data/p1/README.md#L15)
55
+
56
+ ### 2. Observation legibility still conflicts with the official tolerance semantics
57
+
58
+ The env prints hard constraint thresholds in diagnostics, but actual feasibility is decided by `GeometricalProblem.is_feasible()` using the official 1% tolerance.
59
+
60
+ That means a human can see:
61
+
62
+ - a printed bound like `edge_iota_over_nfp >= 0.3`
63
+ - a reported value slightly below `0.3`
64
+ - `constraints=SATISFIED`
65
+
66
+ This is a direct usability problem for a repo whose goal is a legible, human-playable environment.
67
+
68
+ Relevant references:
69
+
70
+ - [server/environment.py](/Users/suhjungdae/code/fusion-design-lab/server/environment.py#L291)
71
+ - [server/physics.py](/Users/suhjungdae/code/fusion-design-lab/server/physics.py#L114)
72
+ - [docs/P1_PARAMETERIZATION_DEEPDIVE.md](/Users/suhjungdae/code/fusion-design-lab/docs/P1_PARAMETERIZATION_DEEPDIVE.md#L202)
73
+
74
+ ### 3. High-fidelity best-state bookkeeping is still not trustworthy
75
+
76
+ `best_high_fidelity_score` and `best_high_fidelity_feasibility` are not true best high-fidelity values. On submit, the env can overwrite them by reevaluating `best_params`, even if the current submit is the best actual high-fidelity result.
77
+
78
+ This weakens the clean low-fi/high-fi split the contract calls for.
79
+
80
+ Relevant references:
81
+
82
+ - [server/environment.py](/Users/suhjungdae/code/fusion-design-lab/server/environment.py#L368)
83
+ - [server/environment.py](/Users/suhjungdae/code/fusion-design-lab/server/environment.py#L449)
84
+ - [docs/P1_ENV_CONTRACT_V1.md](/Users/suhjungdae/code/fusion-design-lab/docs/P1_ENV_CONTRACT_V1.md#L163)
85
+
86
+ ### 4. Submit is operationally expensive
87
+
88
+ One `submit` can trigger up to three high-fidelity evaluations:
89
+
90
+ 1. current design
91
+ 2. initial high-fidelity reference
92
+ 3. reevaluated `best_params`
93
+
94
+ That makes baseline comparison and manual validation much slower than the task surface implies.
95
+
96
+ Relevant references:
97
+
98
+ - [server/environment.py](/Users/suhjungdae/code/fusion-design-lab/server/environment.py#L151)
99
+ - [server/environment.py](/Users/suhjungdae/code/fusion-design-lab/server/environment.py#L442)
100
+
101
+ ### 5. Baseline comparison does not reject failed high-fidelity submits
102
+
103
+ [baselines/compare.py](/Users/suhjungdae/code/fusion-design-lab/baselines/compare.py#L54) only checks that the final step used high-fidelity evaluation. It does not check whether that high-fidelity submit succeeded.
104
+
105
+ That means a failed submit can still count as a terminal high-fi result in comparison output.
106
+
107
+ ### 6. Verifier failure normalization is incomplete
108
+
109
+ [server/physics.py](/Users/suhjungdae/code/fusion-design-lab/server/physics.py#L87) converts `RuntimeError` into explicit failure metrics, but other exception types can still crash the env instead of producing documented failure observations.
110
+
111
+ ### 7. Submit budget reporting is inconsistent
112
+
113
+ `submit` does not decrement `budget_remaining`, so the terminal observation shows pre-submit budget. Since `submit` is terminal, this is not a serious episode-flow bug, but it is a contract/reporting inconsistency.
114
+
115
+ Relevant reference:
116
+
117
+ - [server/environment.py](/Users/suhjungdae/code/fusion-design-lab/server/environment.py#L151)
118
+
119
+ ### 8. Minor maintenance smells
120
+
121
+ - [server/app.py](/Users/suhjungdae/code/fusion-design-lab/server/app.py#L6) imports `N_FIELD_PERIODS` indirectly via `server.environment` instead of directly from `server.contract`
122
+ - [fusion_lab/models.py](/Users/suhjungdae/code/fusion-design-lab/fusion_lab/models.py#L57) and [fusion_lab/models.py](/Users/suhjungdae/code/fusion-design-lab/fusion_lab/models.py#L62) have misleading defaults if used outside the environment
123
+
124
+ ## Findings From The Quoted External Review
125
+
126
+ ### Confirmed
127
+
128
+ - submit cost is real
129
+ - heuristic baseline is stale
130
+ - `compare.py` should check failed submits explicitly
131
+ - `app.py` re-export dependency is fragile
132
+ - observation defaults are mildly misleading
133
+
134
+ ### Partially Confirmed
135
+
136
+ - submit budget reporting is inconsistent, but it does not allow the agent to exceed the episode budget because submit is terminal
137
+
138
+ ### Not Confirmed
139
+
140
+ The quoted review's highest-severity claim was an `_update_best` feasibility mismatch between:
141
+
142
+ - `constraints_satisfied`
143
+ - `p1_feasibility <= FEASIBILITY_TOLERANCE`
144
+
145
+ I did not reproduce that as a current live bug. I sampled 30 low-fidelity evaluations across the current parameter ranges and found zero mismatches between those two criteria.
146
+
147
+ That issue is only real if upstream breaks the invariant between `GeometricalProblem.is_feasible()` and `compute_feasibility()`. On current `HEAD`, it should not be treated as the top issue.
148
+
149
+ ## Validation Performed
150
+
151
+ ### Compile smoke
152
+
153
+ Passed:
154
+
155
+ ```bash
156
+ uv run python -m py_compile fusion_lab/models.py fusion_lab/client.py server/environment.py server/app.py server/physics.py
157
+ ```
158
+
159
+ ### Reset-seed sanity check
160
+
161
+ All three current reset seeds start low-fidelity infeasible at about:
162
+
163
+ - `p1_feasibility ~= 0.05065`
164
+
165
+ So the task is not trivially solved at reset.
166
+
167
+ ### Sampled low-fidelity trajectory
168
+
169
+ For seed `0`, a simple heuristic-like low-fidelity sequence reached feasibility within 5 steps, and the feasibility-crossing step received a strong positive reward (`+3.1054`).
170
+
171
+ This is a good sign for current `Reward V0` ordering, but it is not a substitute for the fixture/manual-playtest loop the docs require.
172
+
173
+ ### Current workspace state at review time
174
+
175
+ The workspace was dirty during review:
176
+
177
+ - modified [training/notebooks/northflank_smoke.py](/Users/suhjungdae/code/fusion-design-lab/training/notebooks/northflank_smoke.py)
178
+ - untracked [baselines/measured_sweep.py](/Users/suhjungdae/code/fusion-design-lab/baselines/measured_sweep.py)
179
+ - untracked `baselines/sweep_results/`
180
+
181
+ These were treated as workspace observations, not committed `HEAD` findings.
182
+
183
+ ## Commit-History Update
184
+
185
+ Recent history is directionally good and mostly coherent:
186
+
187
+ - `d58c100` made the verifier real by switching to `constellaration`
188
+ - `fe3a41d` repaired the parameterization and added explicit VMEC failure semantics
189
+ - `88d9b78` fixed submit-time mixed-fidelity reward/reporting
190
+ - `eb446cf` fixed the post-failure recovery reward bug
191
+ - `6deaccc` cleaned up contract naming and tightened baseline reporting
192
+
193
+ The current issues are mostly:
194
+
195
+ - clarity
196
+ - bookkeeping
197
+ - operational cost
198
+ - missing validation artifacts
199
+
200
+ not a fundamental verifier-wiring failure
201
+
202
+ ## Current Assessment
203
+
204
+ ### Verifier
205
+
206
+ Status: mostly implemented correctly, not fully productized
207
+
208
+ Strengths:
209
+
210
+ - official `GeometricalProblem` semantics
211
+ - boundary-based evaluation
212
+ - low-fi/high-fi split
213
+ - explicit VMEC failure metrics for common runtime failures
214
+
215
+ Weaknesses:
216
+
217
+ - hidden tolerance semantics in diagnostics
218
+ - incomplete exception normalization
219
+ - high-fidelity best-state tracking ambiguity
220
+
221
+ ### Reward
222
+
223
+ Status: directionally good, still pre-validation
224
+
225
+ Strengths:
226
+
227
+ - scalar and verifier-driven
228
+ - feasibility-first
229
+ - objective shaping only after feasibility
230
+ - explicit failure penalty
231
+ - recovery reward no longer relies on failure sentinels
232
+ - explicit submit remains better than passive exhaustion
233
+
234
+ Weaknesses:
235
+
236
+ - not yet validated by the playtest/fixture loop the repo requires
237
+ - some reward/reporting details still depend on unclear high-fi bookkeeping
238
+
239
+ ## TODOs
240
+
241
+ ### Highest Priority
242
+
243
+ 1. Fix diagnostics so humans can understand official tolerance-based feasibility.
244
+ 2. Fix true high-fidelity best-state bookkeeping.
245
+ 3. Update [baselines/compare.py](/Users/suhjungdae/code/fusion-design-lab/baselines/compare.py) to reject or explicitly flag failed high-fidelity submits.
246
+ 4. Reduce submit-time high-fidelity reevaluation cost.
247
+
248
+ ### Validation Priority
249
+
250
+ 5. Add tracked `P1` fixtures under [server/data/p1/](/Users/suhjungdae/code/fusion-design-lab/server/data/p1/).
251
+ 6. Run fixture sanity checks for:
252
+ - good or near-boundary cases
253
+ - clearly bad cases
254
+ - reward ordering
255
+ 7. Run manual playtesting and record:
256
+ - observation seen
257
+ - action chosen
258
+ - verifier outcome
259
+ - reward returned
260
+ - whether reward matched intuition
261
+ 8. Either:
262
+ - document the first real `Reward V0 -> V1` pathology/fix
263
+ - or explicitly record that `Reward V0` survived playtesting unchanged
264
+
265
+ ### Secondary Cleanup
266
+
267
+ 9. Decide whether submit should decrement `budget_remaining` for contract clarity.
268
+ 10. Import `N_FIELD_PERIODS` in [server/app.py](/Users/suhjungdae/code/fusion-design-lab/server/app.py) directly from `server.contract`.
269
+ 11. Clean up or intentionally commit the current workspace-only baseline and notebook artifacts.
270
+
271
+ ## Bottom Line
272
+
273
+ The verifier is no longer the main blocker. The repo has crossed the line from "not wired" to "mostly wired and mostly correct."
274
+
275
+ The next blocker is proving that the environment is actually a clear, reproducible, human-legible artifact through:
276
+
277
+ - fixtures
278
+ - playtesting
279
+ - reward-validation evidence
280
+
281
+ Until those exist, the verifier and reward should be described as strong `HEAD` implementations with remaining validation debt, not finished hackathon evidence.