Spaces:

CreativeEngineer
/

fusion-design-lab

Paused

App Files Files Community

fusion-design-lab / docs /P1_ENV_CONTRACT_V1.md

CreativeEngineer

feat: reward verifier alignment, notebook hardening, model name fix

cdc237b about 1 month ago

preview code

raw

history blame contribute delete

9.89 kB

	# P1 Environment Contract V1

	Role: Live technical contract SSOT for the current implementation phase
	Planning dependency: [`FUSION_DESIGN_LAB_PLAN_V2.md`](./FUSION_DESIGN_LAB_PLAN_V2.md)
	Evidence dependency: [`P1_PARAMETERIZATION_DEEPDIVE.md`](P1_PARAMETERIZATION_DEEPDIVE.md)

	## 1. Scope

	This document defines the live technical contract for:

	- [`server/physics.py`](../server/physics.py)
	- [`fusion_lab/models.py`](../fusion_lab/models.py)
	- [`server/environment.py`](../server/environment.py)
	- [`server/app.py`](../server/app.py)

	If the observation schema, action schema, episode flow, terminal conditions, or reward semantics change, update this file in the same task.

	## 2. Design Split

	Keep three layers separate:

	1. boundary builder
	2. official verifier
	3. environment

	Boundary builder owns:

	- the repaired low-dimensional family
	- rotating-ellipse seed generation
	- explicit triangularity control injection

	Official verifier owns:

	- boundary in, metrics out
	- official `P1` feasibility semantics
	- objective direction and score ordering
	- low-fidelity live evaluation mode
	- optional higher-fidelity offline validation mode
	- explicit failure results when VMEC or forward-model evaluation fails

	Environment owns:

	- reset pool
	- discrete actions
	- episode budget
	- best-state tracking
	- reward shaping

	## 3. Boundary Family

	The historical 3-knob upstream rotating-ellipse family is not the live contract.

	The live controllable knobs are:

	- `aspect_ratio`
	- `elongation`
	- `rotational_transform`
	- `triangularity_scale`

	Rules:

	- stay low-dimensional and human-playable
	- treat the current family as rotating-ellipse-derived, not plain upstream rotating ellipse
	- the coarse measured sweep is now recorded, but reset-seed changes and any budget changes should still wait for paired high-fidelity fixture checks

	## 4. Action Contract

	`intent` is one of:

	- `run`
	- `submit`
	- `restore_best`

	For `run`, the action also includes:

	- `parameter`: one of `aspect_ratio \| elongation \| rotational_transform \| triangularity_scale`
	- `direction`: `increase \| decrease`
	- `magnitude`: `small \| medium \| large`

	Constraints:

	- keep the discrete interaction style
	- do not expose the full Fourier action space as the primary environment
	- do not use action complexity to compensate for missing clarity elsewhere

	## 5. Observation Contract

	The observation must stay metric-centered and human-readable.

	Required fields:

	- `max_elongation`
	- `aspect_ratio`
	- `average_triangularity`
	- `edge_iota_over_nfp`
	- `aspect_ratio_violation`
	- `triangularity_violation`
	- `iota_violation`
	- `dominant_constraint`
	- `p1_feasibility`
	- `p1_score`
	- `constraints_satisfied`
	- `vacuum_well`
	- `evaluation_fidelity`
	- `evaluation_failed`
	- `failure_reason`
	- `step_number`
	- `budget_remaining`
	- `no_progress_steps`
	- `best_low_fidelity_score`
	- `best_low_fidelity_feasibility`
	- `target_spec`
	- `diagnostics_text`
	- `reward_breakdown`
	- `action_monitor`
	- `episode_total_reward`
	- `trajectory_summary`

	Interpretation rules:

	- live environment metrics must be labeled as low-fidelity
	- best-state reporting should reflect the single live reward surface
	- the observation must be understandable without hidden state
	- normalized constraint-violation telemetry must follow the official `P1` constraint scales
	- the dominant active constraint must be visible so a human can explain repair-phase rewards
	- reward telemetry must expose which bonuses, penalties, and shaping terms contributed to the scalar reward
	- action telemetry must expose parameter values before and after the action, including clamped, no-op, and repeat-state moves
	- anti-stagnation state that can change reward must be visible in structured observation fields, not only free text

	## 6. Episode Flow

	1. Reset from one frozen repaired-family seed or a small frozen seed set.
	2. Evaluate the initial state with low fidelity and return the first observation.
	3. On `run`, perturb one controllable parameter and re-evaluate with low fidelity.
	4. On `restore_best`, revert to the best known low-fidelity state, re-evaluate, and consume budget.
	5. On `submit`, re-evaluate the current state with low fidelity, consume budget, and end the episode.
	6. End the episode on `submit` or budget exhaustion.

	Failure semantics:

	- failed evaluations still consume budget
	- failed evaluations produce visible failure observations
	- failed evaluations apply a documented penalty
	- the environment must not silently convert failures into success paths

	## 7. Terminal Contract

	At termination, the environment must provide:

	- final best design metrics
	- final feasibility status
	- total reward
	- a short human-readable trajectory summary
	- the final reward breakdown and action telemetry for the terminal step

	Terminal reporting rules:

	- keep submit-time reporting on the same live low-fidelity truth surface as the rest of the episode
	- keep any higher-fidelity validation artifacts explicitly outside the live environment observation contract

	## 8. Verifier Contract

	The verifier of record is `constellaration.problems.GeometricalProblem`.

	The implementation must preserve:

	- objective direction
	- constraint direction
	- feasibility semantics
	- score ordering

	The verifier should stay boundary-based:

	- `build_boundary_from_params(...) -> SurfaceRZFourier`
	- `evaluate_boundary(boundary, fidelity) -> EvaluationMetrics`

	Do not treat parameterization-specific logic as verifier truth.

	VMEC preset mapping:

	- `run`, `restore_best`, and `submit` use the `low_fidelity` VMEC preset (~0.6s, tolerant convergence)
	- higher-fidelity validation uses the `from_boundary_resolution` VMEC preset (~4s, adaptive convergence matching boundary Fourier resolution) outside the live environment loop
	- the `high_fidelity` VMEC preset (minimum 10 modes, strict convergence) is not used because it does not converge on the current `mpol=3, ntor=3` boundaries

	Training and evaluation rule:

	- use the live low-fidelity environment contract, including explicit `submit`, as the RL surface
	- the standard repository notebook and `training/llm_rollout.py` workflows should stay aligned to that same action and reward contract
	- keep higher-fidelity validation in offline scripts, paired fixture checks, and final evidence artifacts
	- do not reintroduce a separate high-fidelity submit path into the live environment unless the contract is deliberately redefined

	## 9. Reward V2

	`Reward V2` keeps the verifier-native structure from `Reward V1` and adds a small amount of
	trajectory-aware shaping. `Reward V1` fixed the main coarse-signal pathology from `Reward V0`:
	pure `Δ official_feasibility` was too coarse because official feasibility is a max over
	normalized constraint violations, so useful repair steps on non-dominant constraints could be
	nearly invisible to the reward.

	The remaining `Reward V1` pathology was not verifier mismatch. It was short-horizon shaping:

	- the agent got no extra signal for setting a new best infeasible point
	- near-feasible progress below `0.02` had no milestone signal unless it crossed the full feasible boundary
	- feasible improvements only saw step-to-step objective deltas, not "new best feasible score" progress
	- repeated local loops or three-step stagnation had no explicit penalty beyond normal step cost

	Target behavior:

	- infeasible to feasible crossing gets a clear positive bonus
	- feasible to infeasible regression gets a clear penalty
	- when both states are infeasible, reduced official feasibility violation should still help
	- on low-fidelity `run` steps, setting a new best infeasible feasibility should help
	- entering the near-feasible corridor around `p1_feasibility <= 0.02` should get a small bounded bonus
	- when both states are infeasible, reduced normalized triangularity violation should help the most
	- when both states are infeasible, reduced normalized aspect-ratio and edge-iota violations should also help
	- when both states are feasible, lower `max_elongation` should help
	- on low-fidelity `run` steps, beating the previous best feasible score should help
	- larger `run` actions should pay a larger step cost than smaller `run` actions
	- `restore_best` should keep a flat non-submit step cost
	- repeated local revisits without improvement should pay a small penalty
	- three non-improving steps in a row should pay a small stagnation penalty
	- `submit` should be better than passive exhaustion when the design is genuinely improved
	- recovery after a failed evaluation may receive a modest bounded bonus

	Rules:

	- keep reward scalar and verifier-driven
	- keep the infeasible shaping tied to official normalized constraint violations, not family-name priors
	- do not add family-specific reward shaping from `scadena`, `CreativeEngineer`, `Samet`, or `egodos`
	- do not use reward complexity to compensate for blocked parameterization, poor seeds, or unclear observations

	## 10. Reset and Fixture Policy

	Reset policy:

	- start with exact frozen seeds
	- keep `n_field_periods = 3`
	- prefer a small reproducible seed set

	Each seed should be:

	- reproducible
	- near enough to the feasible boundary to make the budget meaningful
	- not already solved

	Fixture policy:

	- track good, boundary, and clearly bad references
	- use fixtures for verifier and reward sanity checks
	- do not turn fixture mining into a separate broad project

	## 11. Open Measurements

	These items remain open until measured on the repaired family:

	- exact repaired-family range bounds
	- exact `triangularity_scale` deltas
	- exact `rotational_transform` bounds
	- exact reset seed pool
	- whether the budget should stay at 6 or change

	## 12. Out of Scope

	- porting the old `ai-sci-feasible-designs` harness
	- broad Fourier-mode action space as the main environment
	- complicated reward shaping before playtest evidence
	- a wider task family than the single stellarator environment