Commit ·
f09f601
1
Parent(s): e815b38
docs: add verifier reward review
Browse files
docs/VERIFIER_REWARD_REVIEW_2026-03-08.md
ADDED
|
@@ -0,0 +1,281 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Verifier And Reward Review
|
| 2 |
+
|
| 3 |
+
Date: 2026-03-08
|
| 4 |
+
|
| 5 |
+
## Scope
|
| 6 |
+
|
| 7 |
+
This note reviews how well the current verifier path and reward function serve the repo's stated goal:
|
| 8 |
+
|
| 9 |
+
- ship one clear, reproducible OpenEnv environment for budget-constrained stellarator design
|
| 10 |
+
- keep the verifier official
|
| 11 |
+
- keep the task human-playable
|
| 12 |
+
- keep reward explainable and iteratively improved
|
| 13 |
+
|
| 14 |
+
Primary review targets:
|
| 15 |
+
|
| 16 |
+
- [docs/FUSION_DESIGN_LAB_PLAN_V2.md](/Users/suhjungdae/code/fusion-design-lab/docs/FUSION_DESIGN_LAB_PLAN_V2.md)
|
| 17 |
+
- [docs/P1_ENV_CONTRACT_V1.md](/Users/suhjungdae/code/fusion-design-lab/docs/P1_ENV_CONTRACT_V1.md)
|
| 18 |
+
- [server/physics.py](/Users/suhjungdae/code/fusion-design-lab/server/physics.py)
|
| 19 |
+
- [server/environment.py](/Users/suhjungdae/code/fusion-design-lab/server/environment.py)
|
| 20 |
+
- recent history from `d58c100` through `6deaccc`
|
| 21 |
+
|
| 22 |
+
## Executive Summary
|
| 23 |
+
|
| 24 |
+
The verifier implementation is directionally strong and mostly correct now. The major March 7, 2026 commit sequence fixed the important correctness problems:
|
| 25 |
+
|
| 26 |
+
- replaced the synthetic evaluator with real `constellaration` / `GeometricalProblem`
|
| 27 |
+
- repaired the blocked 3-knob family by adding explicit triangularity control
|
| 28 |
+
- separated low-fidelity `run` truth from high-fidelity `submit` truth
|
| 29 |
+
- fixed the post-VMEC-failure recovery reward bug
|
| 30 |
+
|
| 31 |
+
The reward is also directionally good. `Reward V0` remains simple, mostly verifier-driven, and aligned with the repo docs.
|
| 32 |
+
|
| 33 |
+
The main gap is no longer basic verifier wiring. The main gap is that the repo still lacks the validation evidence that its own docs require before calling the environment "good":
|
| 34 |
+
|
| 35 |
+
- no tracked fixtures
|
| 36 |
+
- no manual playtest log
|
| 37 |
+
- no documented reward pathology/fix or explicit "Reward V0 survived playtest" note
|
| 38 |
+
|
| 39 |
+
## Validated Findings
|
| 40 |
+
|
| 41 |
+
### 1. Missing validation artifacts are still the biggest repo-level gap
|
| 42 |
+
|
| 43 |
+
The planning docs explicitly say the environment artifact is the product, not just the code. Current repo state still lacks:
|
| 44 |
+
|
| 45 |
+
- tracked `P1` fixtures
|
| 46 |
+
- manual playtest evidence
|
| 47 |
+
- a documented `Reward V0 -> V1` change, or an explicit record that `Reward V0` survived playtesting unchanged
|
| 48 |
+
|
| 49 |
+
Relevant references:
|
| 50 |
+
|
| 51 |
+
- [docs/FUSION_DESIGN_LAB_PLAN_V2.md](/Users/suhjungdae/code/fusion-design-lab/docs/FUSION_DESIGN_LAB_PLAN_V2.md#L36)
|
| 52 |
+
- [docs/FUSION_DESIGN_LAB_PLAN_V2.md](/Users/suhjungdae/code/fusion-design-lab/docs/FUSION_DESIGN_LAB_PLAN_V2.md#L420)
|
| 53 |
+
- [docs/FUSION_DELIVERABLES_MAP.md](/Users/suhjungdae/code/fusion-design-lab/docs/FUSION_DELIVERABLES_MAP.md#L21)
|
| 54 |
+
- [server/data/p1/README.md](/Users/suhjungdae/code/fusion-design-lab/server/data/p1/README.md#L15)
|
| 55 |
+
|
| 56 |
+
### 2. Observation legibility still conflicts with the official tolerance semantics
|
| 57 |
+
|
| 58 |
+
The env prints hard constraint thresholds in diagnostics, but actual feasibility is decided by `GeometricalProblem.is_feasible()` using the official 1% tolerance.
|
| 59 |
+
|
| 60 |
+
That means a human can see:
|
| 61 |
+
|
| 62 |
+
- a printed bound like `edge_iota_over_nfp >= 0.3`
|
| 63 |
+
- a reported value slightly below `0.3`
|
| 64 |
+
- `constraints=SATISFIED`
|
| 65 |
+
|
| 66 |
+
This is a direct usability problem for a repo whose goal is a legible, human-playable environment.
|
| 67 |
+
|
| 68 |
+
Relevant references:
|
| 69 |
+
|
| 70 |
+
- [server/environment.py](/Users/suhjungdae/code/fusion-design-lab/server/environment.py#L291)
|
| 71 |
+
- [server/physics.py](/Users/suhjungdae/code/fusion-design-lab/server/physics.py#L114)
|
| 72 |
+
- [docs/P1_PARAMETERIZATION_DEEPDIVE.md](/Users/suhjungdae/code/fusion-design-lab/docs/P1_PARAMETERIZATION_DEEPDIVE.md#L202)
|
| 73 |
+
|
| 74 |
+
### 3. High-fidelity best-state bookkeeping is still not trustworthy
|
| 75 |
+
|
| 76 |
+
`best_high_fidelity_score` and `best_high_fidelity_feasibility` are not true best high-fidelity values. On submit, the env can overwrite them by reevaluating `best_params`, even if the current submit is the best actual high-fidelity result.
|
| 77 |
+
|
| 78 |
+
This weakens the clean low-fi/high-fi split the contract calls for.
|
| 79 |
+
|
| 80 |
+
Relevant references:
|
| 81 |
+
|
| 82 |
+
- [server/environment.py](/Users/suhjungdae/code/fusion-design-lab/server/environment.py#L368)
|
| 83 |
+
- [server/environment.py](/Users/suhjungdae/code/fusion-design-lab/server/environment.py#L449)
|
| 84 |
+
- [docs/P1_ENV_CONTRACT_V1.md](/Users/suhjungdae/code/fusion-design-lab/docs/P1_ENV_CONTRACT_V1.md#L163)
|
| 85 |
+
|
| 86 |
+
### 4. Submit is operationally expensive
|
| 87 |
+
|
| 88 |
+
One `submit` can trigger up to three high-fidelity evaluations:
|
| 89 |
+
|
| 90 |
+
1. current design
|
| 91 |
+
2. initial high-fidelity reference
|
| 92 |
+
3. reevaluated `best_params`
|
| 93 |
+
|
| 94 |
+
That makes baseline comparison and manual validation much slower than the task surface implies.
|
| 95 |
+
|
| 96 |
+
Relevant references:
|
| 97 |
+
|
| 98 |
+
- [server/environment.py](/Users/suhjungdae/code/fusion-design-lab/server/environment.py#L151)
|
| 99 |
+
- [server/environment.py](/Users/suhjungdae/code/fusion-design-lab/server/environment.py#L442)
|
| 100 |
+
|
| 101 |
+
### 5. Baseline comparison does not reject failed high-fidelity submits
|
| 102 |
+
|
| 103 |
+
[baselines/compare.py](/Users/suhjungdae/code/fusion-design-lab/baselines/compare.py#L54) only checks that the final step used high-fidelity evaluation. It does not check whether that high-fidelity submit succeeded.
|
| 104 |
+
|
| 105 |
+
That means a failed submit can still count as a terminal high-fi result in comparison output.
|
| 106 |
+
|
| 107 |
+
### 6. Verifier failure normalization is incomplete
|
| 108 |
+
|
| 109 |
+
[server/physics.py](/Users/suhjungdae/code/fusion-design-lab/server/physics.py#L87) converts `RuntimeError` into explicit failure metrics, but other exception types can still crash the env instead of producing documented failure observations.
|
| 110 |
+
|
| 111 |
+
### 7. Submit budget reporting is inconsistent
|
| 112 |
+
|
| 113 |
+
`submit` does not decrement `budget_remaining`, so the terminal observation shows pre-submit budget. Since `submit` is terminal, this is not a serious episode-flow bug, but it is a contract/reporting inconsistency.
|
| 114 |
+
|
| 115 |
+
Relevant reference:
|
| 116 |
+
|
| 117 |
+
- [server/environment.py](/Users/suhjungdae/code/fusion-design-lab/server/environment.py#L151)
|
| 118 |
+
|
| 119 |
+
### 8. Minor maintenance smells
|
| 120 |
+
|
| 121 |
+
- [server/app.py](/Users/suhjungdae/code/fusion-design-lab/server/app.py#L6) imports `N_FIELD_PERIODS` indirectly via `server.environment` instead of directly from `server.contract`
|
| 122 |
+
- [fusion_lab/models.py](/Users/suhjungdae/code/fusion-design-lab/fusion_lab/models.py#L57) and [fusion_lab/models.py](/Users/suhjungdae/code/fusion-design-lab/fusion_lab/models.py#L62) have misleading defaults if used outside the environment
|
| 123 |
+
|
| 124 |
+
## Findings From The Quoted External Review
|
| 125 |
+
|
| 126 |
+
### Confirmed
|
| 127 |
+
|
| 128 |
+
- submit cost is real
|
| 129 |
+
- heuristic baseline is stale
|
| 130 |
+
- `compare.py` should check failed submits explicitly
|
| 131 |
+
- `app.py` re-export dependency is fragile
|
| 132 |
+
- observation defaults are mildly misleading
|
| 133 |
+
|
| 134 |
+
### Partially Confirmed
|
| 135 |
+
|
| 136 |
+
- submit budget reporting is inconsistent, but it does not allow the agent to exceed the episode budget because submit is terminal
|
| 137 |
+
|
| 138 |
+
### Not Confirmed
|
| 139 |
+
|
| 140 |
+
The quoted review's highest-severity claim was an `_update_best` feasibility mismatch between:
|
| 141 |
+
|
| 142 |
+
- `constraints_satisfied`
|
| 143 |
+
- `p1_feasibility <= FEASIBILITY_TOLERANCE`
|
| 144 |
+
|
| 145 |
+
I did not reproduce that as a current live bug. I sampled 30 low-fidelity evaluations across the current parameter ranges and found zero mismatches between those two criteria.
|
| 146 |
+
|
| 147 |
+
That issue is only real if upstream breaks the invariant between `GeometricalProblem.is_feasible()` and `compute_feasibility()`. On current `HEAD`, it should not be treated as the top issue.
|
| 148 |
+
|
| 149 |
+
## Validation Performed
|
| 150 |
+
|
| 151 |
+
### Compile smoke
|
| 152 |
+
|
| 153 |
+
Passed:
|
| 154 |
+
|
| 155 |
+
```bash
|
| 156 |
+
uv run python -m py_compile fusion_lab/models.py fusion_lab/client.py server/environment.py server/app.py server/physics.py
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
### Reset-seed sanity check
|
| 160 |
+
|
| 161 |
+
All three current reset seeds start low-fidelity infeasible at about:
|
| 162 |
+
|
| 163 |
+
- `p1_feasibility ~= 0.05065`
|
| 164 |
+
|
| 165 |
+
So the task is not trivially solved at reset.
|
| 166 |
+
|
| 167 |
+
### Sampled low-fidelity trajectory
|
| 168 |
+
|
| 169 |
+
For seed `0`, a simple heuristic-like low-fidelity sequence reached feasibility within 5 steps, and the feasibility-crossing step received a strong positive reward (`+3.1054`).
|
| 170 |
+
|
| 171 |
+
This is a good sign for current `Reward V0` ordering, but it is not a substitute for the fixture/manual-playtest loop the docs require.
|
| 172 |
+
|
| 173 |
+
### Current workspace state at review time
|
| 174 |
+
|
| 175 |
+
The workspace was dirty during review:
|
| 176 |
+
|
| 177 |
+
- modified [training/notebooks/northflank_smoke.py](/Users/suhjungdae/code/fusion-design-lab/training/notebooks/northflank_smoke.py)
|
| 178 |
+
- untracked [baselines/measured_sweep.py](/Users/suhjungdae/code/fusion-design-lab/baselines/measured_sweep.py)
|
| 179 |
+
- untracked `baselines/sweep_results/`
|
| 180 |
+
|
| 181 |
+
These were treated as workspace observations, not committed `HEAD` findings.
|
| 182 |
+
|
| 183 |
+
## Commit-History Update
|
| 184 |
+
|
| 185 |
+
Recent history is directionally good and mostly coherent:
|
| 186 |
+
|
| 187 |
+
- `d58c100` made the verifier real by switching to `constellaration`
|
| 188 |
+
- `fe3a41d` repaired the parameterization and added explicit VMEC failure semantics
|
| 189 |
+
- `88d9b78` fixed submit-time mixed-fidelity reward/reporting
|
| 190 |
+
- `eb446cf` fixed the post-failure recovery reward bug
|
| 191 |
+
- `6deaccc` cleaned up contract naming and tightened baseline reporting
|
| 192 |
+
|
| 193 |
+
The current issues are mostly:
|
| 194 |
+
|
| 195 |
+
- clarity
|
| 196 |
+
- bookkeeping
|
| 197 |
+
- operational cost
|
| 198 |
+
- missing validation artifacts
|
| 199 |
+
|
| 200 |
+
not a fundamental verifier-wiring failure
|
| 201 |
+
|
| 202 |
+
## Current Assessment
|
| 203 |
+
|
| 204 |
+
### Verifier
|
| 205 |
+
|
| 206 |
+
Status: mostly implemented correctly, not fully productized
|
| 207 |
+
|
| 208 |
+
Strengths:
|
| 209 |
+
|
| 210 |
+
- official `GeometricalProblem` semantics
|
| 211 |
+
- boundary-based evaluation
|
| 212 |
+
- low-fi/high-fi split
|
| 213 |
+
- explicit VMEC failure metrics for common runtime failures
|
| 214 |
+
|
| 215 |
+
Weaknesses:
|
| 216 |
+
|
| 217 |
+
- hidden tolerance semantics in diagnostics
|
| 218 |
+
- incomplete exception normalization
|
| 219 |
+
- high-fidelity best-state tracking ambiguity
|
| 220 |
+
|
| 221 |
+
### Reward
|
| 222 |
+
|
| 223 |
+
Status: directionally good, still pre-validation
|
| 224 |
+
|
| 225 |
+
Strengths:
|
| 226 |
+
|
| 227 |
+
- scalar and verifier-driven
|
| 228 |
+
- feasibility-first
|
| 229 |
+
- objective shaping only after feasibility
|
| 230 |
+
- explicit failure penalty
|
| 231 |
+
- recovery reward no longer relies on failure sentinels
|
| 232 |
+
- explicit submit remains better than passive exhaustion
|
| 233 |
+
|
| 234 |
+
Weaknesses:
|
| 235 |
+
|
| 236 |
+
- not yet validated by the playtest/fixture loop the repo requires
|
| 237 |
+
- some reward/reporting details still depend on unclear high-fi bookkeeping
|
| 238 |
+
|
| 239 |
+
## TODOs
|
| 240 |
+
|
| 241 |
+
### Highest Priority
|
| 242 |
+
|
| 243 |
+
1. Fix diagnostics so humans can understand official tolerance-based feasibility.
|
| 244 |
+
2. Fix true high-fidelity best-state bookkeeping.
|
| 245 |
+
3. Update [baselines/compare.py](/Users/suhjungdae/code/fusion-design-lab/baselines/compare.py) to reject or explicitly flag failed high-fidelity submits.
|
| 246 |
+
4. Reduce submit-time high-fidelity reevaluation cost.
|
| 247 |
+
|
| 248 |
+
### Validation Priority
|
| 249 |
+
|
| 250 |
+
5. Add tracked `P1` fixtures under [server/data/p1/](/Users/suhjungdae/code/fusion-design-lab/server/data/p1/).
|
| 251 |
+
6. Run fixture sanity checks for:
|
| 252 |
+
- good or near-boundary cases
|
| 253 |
+
- clearly bad cases
|
| 254 |
+
- reward ordering
|
| 255 |
+
7. Run manual playtesting and record:
|
| 256 |
+
- observation seen
|
| 257 |
+
- action chosen
|
| 258 |
+
- verifier outcome
|
| 259 |
+
- reward returned
|
| 260 |
+
- whether reward matched intuition
|
| 261 |
+
8. Either:
|
| 262 |
+
- document the first real `Reward V0 -> V1` pathology/fix
|
| 263 |
+
- or explicitly record that `Reward V0` survived playtesting unchanged
|
| 264 |
+
|
| 265 |
+
### Secondary Cleanup
|
| 266 |
+
|
| 267 |
+
9. Decide whether submit should decrement `budget_remaining` for contract clarity.
|
| 268 |
+
10. Import `N_FIELD_PERIODS` in [server/app.py](/Users/suhjungdae/code/fusion-design-lab/server/app.py) directly from `server.contract`.
|
| 269 |
+
11. Clean up or intentionally commit the current workspace-only baseline and notebook artifacts.
|
| 270 |
+
|
| 271 |
+
## Bottom Line
|
| 272 |
+
|
| 273 |
+
The verifier is no longer the main blocker. The repo has crossed the line from "not wired" to "mostly wired and mostly correct."
|
| 274 |
+
|
| 275 |
+
The next blocker is proving that the environment is actually a clear, reproducible, human-legible artifact through:
|
| 276 |
+
|
| 277 |
+
- fixtures
|
| 278 |
+
- playtesting
|
| 279 |
+
- reward-validation evidence
|
| 280 |
+
|
| 281 |
+
Until those exist, the verifier and reward should be described as strong `HEAD` implementations with remaining validation debt, not finished hackathon evidence.
|