P1 Environment Contract V1
Role: Live technical contract SSOT for the current implementation phase
Planning dependency: FUSION_DESIGN_LAB_PLAN_V2.md
Evidence dependency: P1_PARAMETERIZATION_DEEPDIVE.md
1. Scope
This document defines the live technical contract for:
If the observation schema, action schema, episode flow, terminal conditions, or reward semantics change, update this file in the same task.
2. Design Split
Keep three layers separate:
- boundary builder
- official verifier
- environment
Boundary builder owns:
- the repaired low-dimensional family
- rotating-ellipse seed generation
- explicit triangularity control injection
Official verifier owns:
- boundary in, metrics out
- official
P1feasibility semantics - objective direction and score ordering
- low-fidelity live evaluation mode
- optional higher-fidelity offline validation mode
- explicit failure results when VMEC or forward-model evaluation fails
Environment owns:
- reset pool
- discrete actions
- episode budget
- best-state tracking
- reward shaping
3. Boundary Family
The historical 3-knob upstream rotating-ellipse family is not the live contract.
The live controllable knobs are:
aspect_ratioelongationrotational_transformtriangularity_scale
Rules:
- stay low-dimensional and human-playable
- treat the current family as rotating-ellipse-derived, not plain upstream rotating ellipse
- the coarse measured sweep is now recorded, but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks
4. Action Contract
intent is one of:
runsubmitrestore_best
For run, the action also includes:
parameter: one ofaspect_ratio | elongation | rotational_transform | triangularity_scaledirection:increase | decreasemagnitude:small | medium | large
Constraints:
- keep the discrete interaction style
- do not expose the full Fourier action space as the primary environment
- do not use action complexity to compensate for missing clarity elsewhere
5. Observation Contract
The observation must stay metric-centered and human-readable.
Required fields:
max_elongationaspect_ratioaverage_triangularityedge_iota_over_nfpaspect_ratio_violationtriangularity_violationiota_violationdominant_constraintp1_feasibilityp1_scoreconstraints_satisfiedvacuum_wellevaluation_fidelityevaluation_failedfailure_reasonstep_numberbudget_remainingno_progress_stepsbest_low_fidelity_scorebest_low_fidelity_feasibilitytarget_specdiagnostics_textreward_breakdownaction_monitorepisode_total_rewardtrajectory_summary
Interpretation rules:
- live environment metrics must be labeled as low-fidelity
- best-state reporting should reflect the single live reward surface
- the observation must be understandable without hidden state
- normalized constraint-violation telemetry must follow the official
P1constraint scales - the dominant active constraint must be visible so a human can explain repair-phase rewards
- reward telemetry must expose which bonuses, penalties, and shaping terms contributed to the scalar reward
- action telemetry must expose parameter values before and after the action, including clamped, no-op, and repeat-state moves
- anti-stagnation state that can change reward must be visible in structured observation fields, not only free text
6. Episode Flow
- Reset from one frozen repaired-family seed or a small frozen seed set.
- Evaluate the initial state with low fidelity and return the first observation.
- On
run, perturb one controllable parameter and re-evaluate with low fidelity. - On
restore_best, revert to the best known low-fidelity state, re-evaluate, and consume budget. - On
submit, re-evaluate the current state with low fidelity, consume budget, and end the episode. - End the episode on
submitor budget exhaustion.
Failure semantics:
- failed evaluations still consume budget
- failed evaluations produce visible failure observations
- failed evaluations apply a documented penalty
- the environment must not silently convert failures into success paths
7. Terminal Contract
At termination, the environment must provide:
- final best design metrics
- final feasibility status
- total reward
- a short human-readable trajectory summary
- the final reward breakdown and action telemetry for the terminal step
Terminal reporting rules:
- keep submit-time reporting on the same live low-fidelity truth surface as the rest of the episode
- keep any higher-fidelity validation artifacts explicitly outside the live environment observation contract
8. Verifier Contract
The verifier of record is constellaration.problems.GeometricalProblem.
The implementation must preserve:
- objective direction
- constraint direction
- feasibility semantics
- score ordering
The verifier should stay boundary-based:
build_boundary_from_params(...) -> SurfaceRZFourierevaluate_boundary(boundary, fidelity) -> EvaluationMetrics
Do not treat parameterization-specific logic as verifier truth.
VMEC preset mapping:
run,restore_best, andsubmituse thelow_fidelityVMEC preset (~0.6s, tolerant convergence)- higher-fidelity validation uses the
from_boundary_resolutionVMEC preset (~4s, adaptive convergence matching boundary Fourier resolution) outside the live environment loop - the
high_fidelityVMEC preset (minimum 10 modes, strict convergence) is not used because it does not converge on the currentmpol=3, ntor=3boundaries
Training and evaluation rule:
- use the live low-fidelity environment contract, including explicit
submit, as the RL surface - the standard repository notebook and
training/llm_rollout.pyworkflows should stay aligned to that same action and reward contract - keep higher-fidelity validation in offline scripts, paired fixture checks, and final evidence artifacts
- do not reintroduce a separate high-fidelity submit path into the live environment unless the contract is deliberately redefined
9. Reward V2
Reward V2 keeps the verifier-native structure from Reward V1 and adds a small amount of
trajectory-aware shaping. Reward V1 fixed the main coarse-signal pathology from Reward V0:
pure Δ official_feasibility was too coarse because official feasibility is a max over
normalized constraint violations, so useful repair steps on non-dominant constraints could be
nearly invisible to the reward.
The remaining Reward V1 pathology was not verifier mismatch. It was short-horizon shaping:
- the agent got no extra signal for setting a new best infeasible point
- near-feasible progress below
0.02had no milestone signal unless it crossed the full feasible boundary - feasible improvements only saw step-to-step objective deltas, not "new best feasible score" progress
- repeated local loops or three-step stagnation had no explicit penalty beyond normal step cost
Target behavior:
- infeasible to feasible crossing gets a clear positive bonus
- feasible to infeasible regression gets a clear penalty
- when both states are infeasible, reduced official feasibility violation should still help
- on low-fidelity
runsteps, setting a new best infeasible feasibility should help - entering the near-feasible corridor around
p1_feasibility <= 0.02should get a small bounded bonus - when both states are infeasible, reduced normalized triangularity violation should help the most
- when both states are infeasible, reduced normalized aspect-ratio and edge-iota violations should also help
- when both states are feasible, lower
max_elongationshould help - on low-fidelity
runsteps, beating the previous best feasible score should help - larger
runactions should pay a larger step cost than smallerrunactions restore_bestshould keep a flat non-submit step cost- repeated local revisits without improvement should pay a small penalty
- three non-improving steps in a row should pay a small stagnation penalty
submitshould be better than passive exhaustion when the design is genuinely improved- recovery after a failed evaluation may receive a modest bounded bonus
Rules:
- keep reward scalar and verifier-driven
- keep the infeasible shaping tied to official normalized constraint violations, not family-name priors
- do not add family-specific reward shaping from
scadena,CreativeEngineer,Samet, oregodos - do not use reward complexity to compensate for blocked parameterization, poor seeds, or unclear observations
10. Reset and Fixture Policy
Reset policy:
- start with exact frozen seeds
- keep
n_field_periods = 3 - prefer a small reproducible seed set
Each seed should be:
- reproducible
- near enough to the feasible boundary to make the budget meaningful
- not already solved
Fixture policy:
- track good, boundary, and clearly bad references
- use fixtures for verifier and reward sanity checks
- do not turn fixture mining into a separate broad project
11. Open Measurements
These items remain open until measured on the repaired family:
- exact repaired-family range bounds
- exact
triangularity_scaledeltas - exact
rotational_transformbounds - exact reset seed pool
- whether the budget should stay at 6 or change
12. Out of Scope
- porting the old
ai-sci-feasible-designsharness - broad Fourier-mode action space as the main environment
- complicated reward shaping before playtest evidence
- a wider task family than the single stellarator environment