fusion-design-lab / docs /P1_ENV_CONTRACT_V1.md
CreativeEngineer's picture
feat: reward verifier alignment, notebook hardening, model name fix
cdc237b
# P1 Environment Contract V1
**Role:** Live technical contract SSOT for the current implementation phase
**Planning dependency:** [`FUSION_DESIGN_LAB_PLAN_V2.md`](./FUSION_DESIGN_LAB_PLAN_V2.md)
**Evidence dependency:** [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md)
## 1. Scope
This document defines the live technical contract for:
- [`server/physics.py`](../server/physics.py)
- [`fusion_lab/models.py`](../fusion_lab/models.py)
- [`server/environment.py`](../server/environment.py)
- [`server/app.py`](../server/app.py)
If the observation schema, action schema, episode flow, terminal conditions, or reward semantics change, update this file in the same task.
## 2. Design Split
Keep three layers separate:
1. boundary builder
2. official verifier
3. environment
Boundary builder owns:
- the repaired low-dimensional family
- rotating-ellipse seed generation
- explicit triangularity control injection
Official verifier owns:
- boundary in, metrics out
- official `P1` feasibility semantics
- objective direction and score ordering
- low-fidelity live evaluation mode
- optional higher-fidelity offline validation mode
- explicit failure results when VMEC or forward-model evaluation fails
Environment owns:
- reset pool
- discrete actions
- episode budget
- best-state tracking
- reward shaping
## 3. Boundary Family
The historical 3-knob upstream rotating-ellipse family is not the live contract.
The live controllable knobs are:
- `aspect_ratio`
- `elongation`
- `rotational_transform`
- `triangularity_scale`
Rules:
- stay low-dimensional and human-playable
- treat the current family as rotating-ellipse-derived, not plain upstream rotating ellipse
- the coarse measured sweep is now recorded, but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks
## 4. Action Contract
`intent` is one of:
- `run`
- `submit`
- `restore_best`
For `run`, the action also includes:
- `parameter`: one of `aspect_ratio | elongation | rotational_transform | triangularity_scale`
- `direction`: `increase | decrease`
- `magnitude`: `small | medium | large`
Constraints:
- keep the discrete interaction style
- do not expose the full Fourier action space as the primary environment
- do not use action complexity to compensate for missing clarity elsewhere
## 5. Observation Contract
The observation must stay metric-centered and human-readable.
Required fields:
- `max_elongation`
- `aspect_ratio`
- `average_triangularity`
- `edge_iota_over_nfp`
- `aspect_ratio_violation`
- `triangularity_violation`
- `iota_violation`
- `dominant_constraint`
- `p1_feasibility`
- `p1_score`
- `constraints_satisfied`
- `vacuum_well`
- `evaluation_fidelity`
- `evaluation_failed`
- `failure_reason`
- `step_number`
- `budget_remaining`
- `no_progress_steps`
- `best_low_fidelity_score`
- `best_low_fidelity_feasibility`
- `target_spec`
- `diagnostics_text`
- `reward_breakdown`
- `action_monitor`
- `episode_total_reward`
- `trajectory_summary`
Interpretation rules:
- live environment metrics must be labeled as low-fidelity
- best-state reporting should reflect the single live reward surface
- the observation must be understandable without hidden state
- normalized constraint-violation telemetry must follow the official `P1` constraint scales
- the dominant active constraint must be visible so a human can explain repair-phase rewards
- reward telemetry must expose which bonuses, penalties, and shaping terms contributed to the scalar reward
- action telemetry must expose parameter values before and after the action, including clamped, no-op, and repeat-state moves
- anti-stagnation state that can change reward must be visible in structured observation fields, not only free text
## 6. Episode Flow
1. Reset from one frozen repaired-family seed or a small frozen seed set.
2. Evaluate the initial state with low fidelity and return the first observation.
3. On `run`, perturb one controllable parameter and re-evaluate with low fidelity.
4. On `restore_best`, revert to the best known low-fidelity state, re-evaluate, and consume budget.
5. On `submit`, re-evaluate the current state with low fidelity, consume budget, and end the episode.
6. End the episode on `submit` or budget exhaustion.
Failure semantics:
- failed evaluations still consume budget
- failed evaluations produce visible failure observations
- failed evaluations apply a documented penalty
- the environment must not silently convert failures into success paths
## 7. Terminal Contract
At termination, the environment must provide:
- final best design metrics
- final feasibility status
- total reward
- a short human-readable trajectory summary
- the final reward breakdown and action telemetry for the terminal step
Terminal reporting rules:
- keep submit-time reporting on the same live low-fidelity truth surface as the rest of the episode
- keep any higher-fidelity validation artifacts explicitly outside the live environment observation contract
## 8. Verifier Contract
The verifier of record is `constellaration.problems.GeometricalProblem`.
The implementation must preserve:
- objective direction
- constraint direction
- feasibility semantics
- score ordering
The verifier should stay boundary-based:
- `build_boundary_from_params(...) -> SurfaceRZFourier`
- `evaluate_boundary(boundary, fidelity) -> EvaluationMetrics`
Do not treat parameterization-specific logic as verifier truth.
VMEC preset mapping:
- `run`, `restore_best`, and `submit` use the `low_fidelity` VMEC preset (~0.6s, tolerant convergence)
- higher-fidelity validation uses the `from_boundary_resolution` VMEC preset (~4s, adaptive convergence matching boundary Fourier resolution) outside the live environment loop
- the `high_fidelity` VMEC preset (minimum 10 modes, strict convergence) is not used because it does not converge on the current `mpol=3, ntor=3` boundaries
Training and evaluation rule:
- use the live low-fidelity environment contract, including explicit `submit`, as the RL surface
- the standard repository notebook and `training/llm_rollout.py` workflows should stay aligned to that same action and reward contract
- keep higher-fidelity validation in offline scripts, paired fixture checks, and final evidence artifacts
- do not reintroduce a separate high-fidelity submit path into the live environment unless the contract is deliberately redefined
## 9. Reward V2
`Reward V2` keeps the verifier-native structure from `Reward V1` and adds a small amount of
trajectory-aware shaping. `Reward V1` fixed the main coarse-signal pathology from `Reward V0`:
pure `Δ official_feasibility` was too coarse because official feasibility is a max over
normalized constraint violations, so useful repair steps on non-dominant constraints could be
nearly invisible to the reward.
The remaining `Reward V1` pathology was not verifier mismatch. It was short-horizon shaping:
- the agent got no extra signal for setting a new best infeasible point
- near-feasible progress below `0.02` had no milestone signal unless it crossed the full feasible boundary
- feasible improvements only saw step-to-step objective deltas, not "new best feasible score" progress
- repeated local loops or three-step stagnation had no explicit penalty beyond normal step cost
Target behavior:
- infeasible to feasible crossing gets a clear positive bonus
- feasible to infeasible regression gets a clear penalty
- when both states are infeasible, reduced official feasibility violation should still help
- on low-fidelity `run` steps, setting a new best infeasible feasibility should help
- entering the near-feasible corridor around `p1_feasibility <= 0.02` should get a small bounded bonus
- when both states are infeasible, reduced normalized triangularity violation should help the most
- when both states are infeasible, reduced normalized aspect-ratio and edge-iota violations should also help
- when both states are feasible, lower `max_elongation` should help
- on low-fidelity `run` steps, beating the previous best feasible score should help
- larger `run` actions should pay a larger step cost than smaller `run` actions
- `restore_best` should keep a flat non-submit step cost
- repeated local revisits without improvement should pay a small penalty
- three non-improving steps in a row should pay a small stagnation penalty
- `submit` should be better than passive exhaustion when the design is genuinely improved
- recovery after a failed evaluation may receive a modest bounded bonus
Rules:
- keep reward scalar and verifier-driven
- keep the infeasible shaping tied to official normalized constraint violations, not family-name priors
- do not add family-specific reward shaping from `scadena`, `CreativeEngineer`, `Samet`, or `egodos`
- do not use reward complexity to compensate for blocked parameterization, poor seeds, or unclear observations
## 10. Reset and Fixture Policy
Reset policy:
- start with exact frozen seeds
- keep `n_field_periods = 3`
- prefer a small reproducible seed set
Each seed should be:
- reproducible
- near enough to the feasible boundary to make the budget meaningful
- not already solved
Fixture policy:
- track good, boundary, and clearly bad references
- use fixtures for verifier and reward sanity checks
- do not turn fixture mining into a separate broad project
## 11. Open Measurements
These items remain open until measured on the repaired family:
- exact repaired-family range bounds
- exact `triangularity_scale` deltas
- exact `rotational_transform` bounds
- exact reset seed pool
- whether the budget should stay at 6 or change
## 12. Out of Scope
- porting the old `ai-sci-feasible-designs` harness
- broad Fourier-mode action space as the main environment
- complicated reward shaping before playtest evidence
- a wider task family than the single stellarator environment