Spaces:

CreativeEngineer
/

fusion-design-lab

Paused

App Files Files Community

fusion-design-lab / docs /P1_ENV_CONTRACT_V1.md

CreativeEngineer

feat: reward verifier alignment, notebook hardening, model name fix

cdc237b about 1 month ago

preview code

raw

history blame contribute delete

9.89 kB

P1 Environment Contract V1

Role: Live technical contract SSOT for the current implementation phase Planning dependency: FUSION_DESIGN_LAB_PLAN_V2.md Evidence dependency: P1_PARAMETERIZATION_DEEPDIVE.md

1. Scope

This document defines the live technical contract for:

If the observation schema, action schema, episode flow, terminal conditions, or reward semantics change, update this file in the same task.

2. Design Split

Keep three layers separate:

boundary builder
official verifier
environment

Boundary builder owns:

the repaired low-dimensional family
rotating-ellipse seed generation
explicit triangularity control injection

Official verifier owns:

boundary in, metrics out
official P1 feasibility semantics
objective direction and score ordering
low-fidelity live evaluation mode
optional higher-fidelity offline validation mode
explicit failure results when VMEC or forward-model evaluation fails

Environment owns:

reset pool
discrete actions
episode budget
best-state tracking
reward shaping

3. Boundary Family

The historical 3-knob upstream rotating-ellipse family is not the live contract.

The live controllable knobs are:

aspect_ratio
elongation
rotational_transform
triangularity_scale

Rules:

stay low-dimensional and human-playable
treat the current family as rotating-ellipse-derived, not plain upstream rotating ellipse
the coarse measured sweep is now recorded, but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks

4. Action Contract

intent is one of:

run
submit
restore_best

For run, the action also includes:

parameter: one of aspect_ratio | elongation | rotational_transform | triangularity_scale
direction: increase | decrease
magnitude: small | medium | large

Constraints:

keep the discrete interaction style
do not expose the full Fourier action space as the primary environment
do not use action complexity to compensate for missing clarity elsewhere

5. Observation Contract

The observation must stay metric-centered and human-readable.

Required fields:

max_elongation
aspect_ratio
average_triangularity
edge_iota_over_nfp
aspect_ratio_violation
triangularity_violation
iota_violation
dominant_constraint
p1_feasibility
p1_score
constraints_satisfied
vacuum_well
evaluation_fidelity
evaluation_failed
failure_reason
step_number
budget_remaining
no_progress_steps
best_low_fidelity_score
best_low_fidelity_feasibility
target_spec
diagnostics_text
reward_breakdown
action_monitor
episode_total_reward
trajectory_summary

Interpretation rules:

live environment metrics must be labeled as low-fidelity
best-state reporting should reflect the single live reward surface
the observation must be understandable without hidden state
normalized constraint-violation telemetry must follow the official P1 constraint scales
the dominant active constraint must be visible so a human can explain repair-phase rewards
reward telemetry must expose which bonuses, penalties, and shaping terms contributed to the scalar reward
action telemetry must expose parameter values before and after the action, including clamped, no-op, and repeat-state moves
anti-stagnation state that can change reward must be visible in structured observation fields, not only free text

6. Episode Flow

Reset from one frozen repaired-family seed or a small frozen seed set.
Evaluate the initial state with low fidelity and return the first observation.
On run, perturb one controllable parameter and re-evaluate with low fidelity.
On restore_best, revert to the best known low-fidelity state, re-evaluate, and consume budget.
On submit, re-evaluate the current state with low fidelity, consume budget, and end the episode.
End the episode on submit or budget exhaustion.

Failure semantics:

failed evaluations still consume budget
failed evaluations produce visible failure observations
failed evaluations apply a documented penalty
the environment must not silently convert failures into success paths

7. Terminal Contract

At termination, the environment must provide:

final best design metrics
final feasibility status
total reward
a short human-readable trajectory summary
the final reward breakdown and action telemetry for the terminal step

Terminal reporting rules:

keep submit-time reporting on the same live low-fidelity truth surface as the rest of the episode
keep any higher-fidelity validation artifacts explicitly outside the live environment observation contract

8. Verifier Contract

The verifier of record is constellaration.problems.GeometricalProblem.

The implementation must preserve:

objective direction
constraint direction
feasibility semantics
score ordering

The verifier should stay boundary-based:

build_boundary_from_params(...) -> SurfaceRZFourier
evaluate_boundary(boundary, fidelity) -> EvaluationMetrics

Do not treat parameterization-specific logic as verifier truth.

VMEC preset mapping:

run, restore_best, and submit use the low_fidelity VMEC preset (~0.6s, tolerant convergence)
higher-fidelity validation uses the from_boundary_resolution VMEC preset (~4s, adaptive convergence matching boundary Fourier resolution) outside the live environment loop
the high_fidelity VMEC preset (minimum 10 modes, strict convergence) is not used because it does not converge on the current mpol=3, ntor=3 boundaries

Training and evaluation rule:

use the live low-fidelity environment contract, including explicit submit, as the RL surface
the standard repository notebook and training/llm_rollout.py workflows should stay aligned to that same action and reward contract
keep higher-fidelity validation in offline scripts, paired fixture checks, and final evidence artifacts
do not reintroduce a separate high-fidelity submit path into the live environment unless the contract is deliberately redefined

9. Reward V2

Reward V2 keeps the verifier-native structure from Reward V1 and adds a small amount of trajectory-aware shaping. Reward V1 fixed the main coarse-signal pathology from Reward V0: pure Δ official_feasibility was too coarse because official feasibility is a max over normalized constraint violations, so useful repair steps on non-dominant constraints could be nearly invisible to the reward.

The remaining Reward V1 pathology was not verifier mismatch. It was short-horizon shaping:

the agent got no extra signal for setting a new best infeasible point
near-feasible progress below 0.02 had no milestone signal unless it crossed the full feasible boundary
feasible improvements only saw step-to-step objective deltas, not "new best feasible score" progress
repeated local loops or three-step stagnation had no explicit penalty beyond normal step cost

Target behavior:

infeasible to feasible crossing gets a clear positive bonus
feasible to infeasible regression gets a clear penalty
when both states are infeasible, reduced official feasibility violation should still help
on low-fidelity run steps, setting a new best infeasible feasibility should help
entering the near-feasible corridor around p1_feasibility <= 0.02 should get a small bounded bonus
when both states are infeasible, reduced normalized triangularity violation should help the most
when both states are infeasible, reduced normalized aspect-ratio and edge-iota violations should also help
when both states are feasible, lower max_elongation should help
on low-fidelity run steps, beating the previous best feasible score should help
larger run actions should pay a larger step cost than smaller run actions
restore_best should keep a flat non-submit step cost
repeated local revisits without improvement should pay a small penalty
three non-improving steps in a row should pay a small stagnation penalty
submit should be better than passive exhaustion when the design is genuinely improved
recovery after a failed evaluation may receive a modest bounded bonus

Rules:

keep reward scalar and verifier-driven
keep the infeasible shaping tied to official normalized constraint violations, not family-name priors
do not add family-specific reward shaping from scadena, CreativeEngineer, Samet, or egodos
do not use reward complexity to compensate for blocked parameterization, poor seeds, or unclear observations

10. Reset and Fixture Policy

Reset policy:

start with exact frozen seeds
keep n_field_periods = 3
prefer a small reproducible seed set

Each seed should be:

reproducible
near enough to the feasible boundary to make the budget meaningful
not already solved

Fixture policy:

track good, boundary, and clearly bad references
use fixtures for verifier and reward sanity checks
do not turn fixture mining into a separate broad project

11. Open Measurements

These items remain open until measured on the repaired family:

exact repaired-family range bounds
exact triangularity_scale deltas
exact rotational_transform bounds
exact reset seed pool
whether the budget should stay at 6 or change

12. Out of Scope

porting the old ai-sci-feasible-designs harness
broad Fourier-mode action space as the main environment
complicated reward shaping before playtest evidence
a wider task family than the single stellarator environment