fusion-design-lab / docs /P1_ENV_CONTRACT_V1.md
CreativeEngineer's picture
feat: reward verifier alignment, notebook hardening, model name fix
cdc237b

P1 Environment Contract V1

Role: Live technical contract SSOT for the current implementation phase Planning dependency: FUSION_DESIGN_LAB_PLAN_V2.md Evidence dependency: P1_PARAMETERIZATION_DEEPDIVE.md

1. Scope

This document defines the live technical contract for:

If the observation schema, action schema, episode flow, terminal conditions, or reward semantics change, update this file in the same task.

2. Design Split

Keep three layers separate:

  1. boundary builder
  2. official verifier
  3. environment

Boundary builder owns:

  • the repaired low-dimensional family
  • rotating-ellipse seed generation
  • explicit triangularity control injection

Official verifier owns:

  • boundary in, metrics out
  • official P1 feasibility semantics
  • objective direction and score ordering
  • low-fidelity live evaluation mode
  • optional higher-fidelity offline validation mode
  • explicit failure results when VMEC or forward-model evaluation fails

Environment owns:

  • reset pool
  • discrete actions
  • episode budget
  • best-state tracking
  • reward shaping

3. Boundary Family

The historical 3-knob upstream rotating-ellipse family is not the live contract.

The live controllable knobs are:

  • aspect_ratio
  • elongation
  • rotational_transform
  • triangularity_scale

Rules:

  • stay low-dimensional and human-playable
  • treat the current family as rotating-ellipse-derived, not plain upstream rotating ellipse
  • the coarse measured sweep is now recorded, but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks

4. Action Contract

intent is one of:

  • run
  • submit
  • restore_best

For run, the action also includes:

  • parameter: one of aspect_ratio | elongation | rotational_transform | triangularity_scale
  • direction: increase | decrease
  • magnitude: small | medium | large

Constraints:

  • keep the discrete interaction style
  • do not expose the full Fourier action space as the primary environment
  • do not use action complexity to compensate for missing clarity elsewhere

5. Observation Contract

The observation must stay metric-centered and human-readable.

Required fields:

  • max_elongation
  • aspect_ratio
  • average_triangularity
  • edge_iota_over_nfp
  • aspect_ratio_violation
  • triangularity_violation
  • iota_violation
  • dominant_constraint
  • p1_feasibility
  • p1_score
  • constraints_satisfied
  • vacuum_well
  • evaluation_fidelity
  • evaluation_failed
  • failure_reason
  • step_number
  • budget_remaining
  • no_progress_steps
  • best_low_fidelity_score
  • best_low_fidelity_feasibility
  • target_spec
  • diagnostics_text
  • reward_breakdown
  • action_monitor
  • episode_total_reward
  • trajectory_summary

Interpretation rules:

  • live environment metrics must be labeled as low-fidelity
  • best-state reporting should reflect the single live reward surface
  • the observation must be understandable without hidden state
  • normalized constraint-violation telemetry must follow the official P1 constraint scales
  • the dominant active constraint must be visible so a human can explain repair-phase rewards
  • reward telemetry must expose which bonuses, penalties, and shaping terms contributed to the scalar reward
  • action telemetry must expose parameter values before and after the action, including clamped, no-op, and repeat-state moves
  • anti-stagnation state that can change reward must be visible in structured observation fields, not only free text

6. Episode Flow

  1. Reset from one frozen repaired-family seed or a small frozen seed set.
  2. Evaluate the initial state with low fidelity and return the first observation.
  3. On run, perturb one controllable parameter and re-evaluate with low fidelity.
  4. On restore_best, revert to the best known low-fidelity state, re-evaluate, and consume budget.
  5. On submit, re-evaluate the current state with low fidelity, consume budget, and end the episode.
  6. End the episode on submit or budget exhaustion.

Failure semantics:

  • failed evaluations still consume budget
  • failed evaluations produce visible failure observations
  • failed evaluations apply a documented penalty
  • the environment must not silently convert failures into success paths

7. Terminal Contract

At termination, the environment must provide:

  • final best design metrics
  • final feasibility status
  • total reward
  • a short human-readable trajectory summary
  • the final reward breakdown and action telemetry for the terminal step

Terminal reporting rules:

  • keep submit-time reporting on the same live low-fidelity truth surface as the rest of the episode
  • keep any higher-fidelity validation artifacts explicitly outside the live environment observation contract

8. Verifier Contract

The verifier of record is constellaration.problems.GeometricalProblem.

The implementation must preserve:

  • objective direction
  • constraint direction
  • feasibility semantics
  • score ordering

The verifier should stay boundary-based:

  • build_boundary_from_params(...) -> SurfaceRZFourier
  • evaluate_boundary(boundary, fidelity) -> EvaluationMetrics

Do not treat parameterization-specific logic as verifier truth.

VMEC preset mapping:

  • run, restore_best, and submit use the low_fidelity VMEC preset (~0.6s, tolerant convergence)
  • higher-fidelity validation uses the from_boundary_resolution VMEC preset (~4s, adaptive convergence matching boundary Fourier resolution) outside the live environment loop
  • the high_fidelity VMEC preset (minimum 10 modes, strict convergence) is not used because it does not converge on the current mpol=3, ntor=3 boundaries

Training and evaluation rule:

  • use the live low-fidelity environment contract, including explicit submit, as the RL surface
  • the standard repository notebook and training/llm_rollout.py workflows should stay aligned to that same action and reward contract
  • keep higher-fidelity validation in offline scripts, paired fixture checks, and final evidence artifacts
  • do not reintroduce a separate high-fidelity submit path into the live environment unless the contract is deliberately redefined

9. Reward V2

Reward V2 keeps the verifier-native structure from Reward V1 and adds a small amount of trajectory-aware shaping. Reward V1 fixed the main coarse-signal pathology from Reward V0: pure Δ official_feasibility was too coarse because official feasibility is a max over normalized constraint violations, so useful repair steps on non-dominant constraints could be nearly invisible to the reward.

The remaining Reward V1 pathology was not verifier mismatch. It was short-horizon shaping:

  • the agent got no extra signal for setting a new best infeasible point
  • near-feasible progress below 0.02 had no milestone signal unless it crossed the full feasible boundary
  • feasible improvements only saw step-to-step objective deltas, not "new best feasible score" progress
  • repeated local loops or three-step stagnation had no explicit penalty beyond normal step cost

Target behavior:

  • infeasible to feasible crossing gets a clear positive bonus
  • feasible to infeasible regression gets a clear penalty
  • when both states are infeasible, reduced official feasibility violation should still help
  • on low-fidelity run steps, setting a new best infeasible feasibility should help
  • entering the near-feasible corridor around p1_feasibility <= 0.02 should get a small bounded bonus
  • when both states are infeasible, reduced normalized triangularity violation should help the most
  • when both states are infeasible, reduced normalized aspect-ratio and edge-iota violations should also help
  • when both states are feasible, lower max_elongation should help
  • on low-fidelity run steps, beating the previous best feasible score should help
  • larger run actions should pay a larger step cost than smaller run actions
  • restore_best should keep a flat non-submit step cost
  • repeated local revisits without improvement should pay a small penalty
  • three non-improving steps in a row should pay a small stagnation penalty
  • submit should be better than passive exhaustion when the design is genuinely improved
  • recovery after a failed evaluation may receive a modest bounded bonus

Rules:

  • keep reward scalar and verifier-driven
  • keep the infeasible shaping tied to official normalized constraint violations, not family-name priors
  • do not add family-specific reward shaping from scadena, CreativeEngineer, Samet, or egodos
  • do not use reward complexity to compensate for blocked parameterization, poor seeds, or unclear observations

10. Reset and Fixture Policy

Reset policy:

  • start with exact frozen seeds
  • keep n_field_periods = 3
  • prefer a small reproducible seed set

Each seed should be:

  • reproducible
  • near enough to the feasible boundary to make the budget meaningful
  • not already solved

Fixture policy:

  • track good, boundary, and clearly bad references
  • use fixtures for verifier and reward sanity checks
  • do not turn fixture mining into a separate broad project

11. Open Measurements

These items remain open until measured on the repaired family:

  • exact repaired-family range bounds
  • exact triangularity_scale deltas
  • exact rotational_transform bounds
  • exact reset seed pool
  • whether the budget should stay at 6 or change

12. Out of Scope

  • porting the old ai-sci-feasible-designs harness
  • broad Fourier-mode action space as the main environment
  • complicated reward shaping before playtest evidence
  • a wider task family than the single stellarator environment