Spaces:

CreativeEngineer
/

fusion-design-lab

Paused

App Files Files Community

CreativeEngineer commited on Mar 8

Commit

8bf0155

1 Parent(s): 567ff67

feat: add replay playtest and tighten fail-fast validation

Browse files

Files changed (7) hide show

README.md +8 -8
TODO.md +5 -3
baselines/measured_sweep.py +45 -12
baselines/replay_playtest.py +207 -0
docs/FUSION_DESIGN_LAB_PLAN_V2.md +6 -2
docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md +6 -5
training/notebooks/northflank_smoke.py +13 -18

README.md CHANGED Viewed

@@ -30,7 +30,7 @@ Implementation status:
 - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
 - the repaired 4-knob low-dimensional family is now wired into the runtime path
 - the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
-- the next runtime work is a tiny low-fi PPO smoke run plus paired high-fidelity fixture checks, then a submit-side manual trace, heuristic refresh, and deployment evidence
 ## Execution Status
@@ -52,7 +52,7 @@ Implementation status:
 - [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
 - [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
 - [x] Add tracked `P1` fixtures under `server/data/p1/`
-- [ ] Run a tiny low-fi PPO smoke run, then record at least one submit-side manual trace and the first real reward pathology
 - [ ] Refresh the heuristic baseline for the real verifier path
 - [ ] Deploy the real environment to HF Space
@@ -67,12 +67,12 @@ Implementation status:
 - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
 - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
 - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
-- The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next fail-fast step is a tiny low-fi PPO smoke run alongside high-fidelity fixture pairing, then a real `submit` trace.
 Current mode:
 - strategic task choice is already locked
-- the next work is a tiny low-fi PPO smoke run plus paired high-fidelity fixture checks, then a submit-side manual trace, heuristic refresh, smoke validation, and deployment
 - new planning text should only appear when a real blocker forces a decision change
 ## Planned Repository Layout
@@ -117,7 +117,7 @@ uv sync --extra notebooks
 - Recommended compute workspace: Northflank Jupyter Notebook with PyTorch on the team H100
 - OpenEnv deployment target: Hugging Face Spaces
 - Minimal submission notebook target: Colab
-- Required notebook artifact: one public Colab notebook, even if it only demonstrates evaluation traces rather than a trained policy
 - Verifier of record: `constellaration.problems.GeometricalProblem`
 - Environment style: fresh wiring in this repo, not a port of the old `ai-sci-feasible-designs` harness
 - Northflank containers are ephemeral, so persistent storage should be attached before relying on saved models, caches, or fixture data
@@ -126,10 +126,10 @@ uv sync --extra notebooks
 ## Immediate Next Steps
-- [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks.
-- [ ] Run a tiny low-fidelity PPO smoke run and save a few trajectories.
 - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
-- [ ] Run at least one submit-side manual trace and record the first real reward pathology, if any.
 - [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
 - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
 - [ ] Deploy the environment to HF Space.

 - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
 - the repaired 4-knob low-dimensional family is now wired into the runtime path
 - the first measured sweep note, tracked low-fidelity fixtures, and an initial low-fidelity manual playtest note now exist
+- the next runtime work is a tiny low-fi PPO smoke run as a diagnostic-only check, followed immediately by paired high-fidelity fixture checks and one real submit-side manual trace
 ## Execution Status
 - [x] Label low-fi `run` truth vs high-fi `submit` truth in observations and task docs
 - [x] Separate high-fidelity submit scoring/reporting from low-fidelity rollout score state
 - [x] Add tracked `P1` fixtures under `server/data/p1/`
+- [ ] Run a tiny low-fi PPO smoke run as a diagnostic-only check, then complete paired high-fidelity fixture checks and at least one real submit-side manual trace before any broader training push
 - [ ] Refresh the heuristic baseline for the real verifier path
 - [ ] Deploy the real environment to HF Space
 - Observation best-state reporting is now split explicitly between low-fidelity rollout state and high-fidelity submit state; baseline traces and demo copy should use those explicit fields rather than infer a mixed best-state story.
 - Budget exhaustion now returns a smaller terminal reward than explicit `submit`; keep that asymmetry when tuning reward so agents still prefer deliberate submission.
 - The real-verifier baseline rerun showed the old heuristic is no longer useful as-is: over 5 seeded episodes, both agents stayed at `0.0` mean best score and the heuristic underperformed random on reward. The heuristic needs redesign after the repaired parameterization and manual playtesting.
+- The first low-fidelity manual playtest note is in [docs/P1_MANUAL_PLAYTEST_LOG.md](docs/P1_MANUAL_PLAYTEST_LOG.md). The next fail-fast step is a tiny low-fi PPO smoke run used only to surface obvious learnability bugs, followed immediately by high-fidelity fixture pairing and one real `submit` trace.
 Current mode:
 - strategic task choice is already locked
+- the next work is a tiny low-fi PPO smoke run as a smoke test only, then paired high-fidelity fixture checks, one submit-side manual trace, heuristic refresh, smoke validation, and deployment
 - new planning text should only appear when a real blocker forces a decision change
 ## Planned Repository Layout
 - Recommended compute workspace: Northflank Jupyter Notebook with PyTorch on the team H100
 - OpenEnv deployment target: Hugging Face Spaces
 - Minimal submission notebook target: Colab
+- Required notebook artifact: one public Colab notebook that demonstrates trained-policy behavior against the environment
 - Verifier of record: `constellaration.problems.GeometricalProblem`
 - Environment style: fresh wiring in this repo, not a port of the old `ai-sci-feasible-designs` harness
 - Northflank containers are ephemeral, so persistent storage should be attached before relying on saved models, caches, or fixture data
 ## Immediate Next Steps
+- [ ] Run a tiny low-fidelity PPO smoke run and stop after a few readable trajectories or one clear failure mode.
+- [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit spot checks immediately after the PPO smoke run.
 - [ ] Decide whether any reset seed should move based on the measured sweep plus those paired checks.
+- [ ] Run at least one submit-side manual trace before any broader training push, then record the first real reward pathology, if any.
 - [ ] Refresh the heuristic baseline using measured sweep and playtest evidence, then save one comparison trace.
 - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
 - [ ] Deploy the environment to HF Space.

TODO.md CHANGED Viewed

@@ -54,9 +54,9 @@ flowchart TD
     B["P1 Contract Lock"] --> D["P1 Models + Environment"]
     C["constellaration Physics Wiring"] --> D
     D --> P["Parameterization Repair"]
-    P --> E["Fixture Checks"]
-    E --> F["Tiny PPO Smoke"]
-    F --> G["Submit-side Manual Playtest"]
     G --> H["Reward V1"]
     H --> I["Baselines"]
     I --> J["HF Space Deploy"]
@@ -196,6 +196,8 @@ flowchart TD
   fail quickly on learnability, reward exploits, and action-space problems before investing in longer training
   Note:
   treat this as a smoke test, not as proof that the terminal `submit` contract is already validated
 - [ ] Manual-playtest 5-10 episodes
   Goal:

     B["P1 Contract Lock"] --> D["P1 Models + Environment"]
     C["constellaration Physics Wiring"] --> D
     D --> P["Parameterization Repair"]
+    P --> F["Tiny PPO Smoke"]
+    F --> E["Fixture Checks"]
+    E --> G["Submit-side Manual Playtest"]
     G --> H["Reward V1"]
     H --> I["Baselines"]
     I --> J["HF Space Deploy"]
   fail quickly on learnability, reward exploits, and action-space problems before investing in longer training
   Note:
   treat this as a smoke test, not as proof that the terminal `submit` contract is already validated
+  stop after a few readable trajectories or one clear failure mode
+  paired high-fidelity fixture checks must happen immediately after this smoke pass
 - [ ] Manual-playtest 5-10 episodes
   Goal:

baselines/measured_sweep.py CHANGED Viewed

@@ -4,7 +4,11 @@ Validates ranges, crash zones, feasibility regions, and identifies
 candidate reset seeds for the repaired low-dimensional family.
 Usage:
-    uv run python baselines/measured_sweep.py
 """
 from __future__ import annotations
@@ -29,15 +33,29 @@ def linspace_inclusive(low: float, high: float, n: int) -> list[float]:
     return [round(float(v), 4) for v in np.linspace(low, high, n)]
 def parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(
         description="Run a measured low-fidelity sweep over the repaired 4-knob family."
     )
-    parser.add_argument(
         "--grid-points",
         type=int,
         default=3,
-        help="Number of evenly spaced points per parameter range.",
     )
     parser.add_argument(
         "--output-dir",
@@ -48,12 +66,15 @@ def parse_args() -> argparse.Namespace:
     return parser.parse_args()
-def run_sweep(*, grid_points: int) -> list[dict]:
-    if grid_points < 2:
-        raise ValueError("--grid-points must be at least 2.")
-    grids = {
-        name: linspace_inclusive(lo, hi, grid_points) for name, (lo, hi) in SWEEP_RANGES.items()
-    }
     configs = list(
         product(
@@ -108,7 +129,8 @@ def run_sweep(*, grid_points: int) -> list[dict]:
                 f"{rate:.1f} eval/s"
             )
-    return results
 def analyze(results: list[dict]) -> dict:
@@ -196,17 +218,28 @@ def analyze(results: list[dict]) -> dict:
 def main() -> None:
     args = parse_args()
-    results = run_sweep(grid_points=args.grid_points)
     out_dir = args.output_dir
     out_dir.mkdir(exist_ok=True)
     timestamp = time.strftime("%Y%m%dT%H%M%SZ", time.gmtime())
     out_path = out_dir / f"measured_sweep_{timestamp}.json"
     analysis = analyze(results)
     with open(out_path, "w") as f:
-        json.dump({"analysis": analysis, "results": results}, f, indent=2)
     print(f"\nResults saved to {out_path}")

 candidate reset seeds for the repaired low-dimensional family.
 Usage:
+    # Broad evenly spaced grid (default 3 points per parameter)
+    uv run python baselines/measured_sweep.py --grid-points 5
+    # Targeted sweep around the known feasible zone
+    uv run python baselines/measured_sweep.py --targeted
 """
 from __future__ import annotations
     return [round(float(v), 4) for v in np.linspace(low, high, n)]
+TARGETED_VALUES: dict[str, list[float]] = {
+    "aspect_ratio": [3.4, 3.6, 3.8],
+    "elongation": [1.2, 1.4, 1.6],
+    "rotational_transform": [1.50, 1.55, 1.60, 1.65, 1.70, 1.75, 1.80],
+    "triangularity_scale": [0.55, 0.58, 0.60, 0.62, 0.65],
+}
 def parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(
         description="Run a measured low-fidelity sweep over the repaired 4-knob family."
     )
+    mode = parser.add_mutually_exclusive_group()
+    mode.add_argument(
         "--grid-points",
         type=int,
         default=3,
+        help="Number of evenly spaced points per parameter range (default: 3).",
+    )
+    mode.add_argument(
+        "--targeted",
+        action="store_true",
+        help="Use the pre-defined targeted value set around the known feasible zone.",
     )
     parser.add_argument(
         "--output-dir",
     return parser.parse_args()
+def run_sweep(*, grid_points: int, targeted: bool = False) -> tuple[list[dict], float]:
+    if targeted:
+        grids = TARGETED_VALUES
+    else:
+        if grid_points < 2:
+            raise ValueError("--grid-points must be at least 2.")
+        grids = {
+            name: linspace_inclusive(lo, hi, grid_points) for name, (lo, hi) in SWEEP_RANGES.items()
+        }
     configs = list(
         product(
                 f"{rate:.1f} eval/s"
             )
+    total_elapsed = time.monotonic() - t0
+    return results, total_elapsed
 def analyze(results: list[dict]) -> dict:
 def main() -> None:
     args = parse_args()
+    results, elapsed_s = run_sweep(
+        grid_points=args.grid_points,
+        targeted=args.targeted,
+    )
     out_dir = args.output_dir
     out_dir.mkdir(exist_ok=True)
+    mode_label = "targeted" if args.targeted else f"grid{args.grid_points}"
     timestamp = time.strftime("%Y%m%dT%H%M%SZ", time.gmtime())
     out_path = out_dir / f"measured_sweep_{timestamp}.json"
     analysis = analyze(results)
+    metadata = {
+        "mode": mode_label,
+        "timestamp": timestamp,
+        "elapsed_seconds": round(elapsed_s, 1),
+        "seconds_per_eval": round(elapsed_s / max(len(results), 1), 2),
+    }
     with open(out_path, "w") as f:
+        json.dump({"metadata": metadata, "analysis": analysis, "results": results}, f, indent=2)
     print(f"\nResults saved to {out_path}")

baselines/replay_playtest.py ADDED Viewed

	@@ -0,0 +1,207 @@

+"""Fixed-action replay playtest for reward branch coverage.
+Runs 5 scripted episodes against StellaratorEnvironment directly.
+Each episode targets specific untested reward branches.
+Episodes:
+  1. Seed 0 — repair + feasible-side objective shaping + budget exhaustion
+  2. Seed 1 — repair from different seed (ar=3.4, rt=1.6)
+  3. Seed 2 — boundary clamping (ar=3.8 = upper bound)
+  4. Seed 0 — push rt into crash zone + restore_best
+  5. Seed 0 — repair + objective move + explicit submit
+"""
+from __future__ import annotations
+import json
+import sys
+from dataclasses import asdict, dataclass
+from typing import Sequence
+from fusion_lab.models import StellaratorAction, StellaratorObservation
+from server.environment import StellaratorEnvironment
+@dataclass(frozen=True)
+class StepRecord:
+    step: int
+    intent: str
+    action_label: str
+    score: float
+    feasibility: float
+    constraints_satisfied: bool
+    evaluation_fidelity: str
+    evaluation_failed: bool
+    max_elongation: float
+    reward: float
+    budget_remaining: int
+    done: bool
+def _action_label(action: StellaratorAction) -> str:
+    if action.intent != "run":
+        return action.intent
+    return f"{action.parameter} {action.direction} {action.magnitude}"
+def _record(obs: StellaratorObservation, step: int, action: StellaratorAction) -> StepRecord:
+    return StepRecord(
+        step=step,
+        intent=action.intent,
+        action_label=_action_label(action),
+        score=obs.p1_score,
+        feasibility=obs.p1_feasibility,
+        constraints_satisfied=obs.constraints_satisfied,
+        evaluation_fidelity=obs.evaluation_fidelity,
+        evaluation_failed=obs.evaluation_failed,
+        max_elongation=obs.max_elongation,
+        reward=obs.reward or 0.0,
+        budget_remaining=obs.budget_remaining,
+        done=obs.done,
+    )
+def _run_episode(
+    env: StellaratorEnvironment,
+    seed: int,
+    actions: Sequence[StellaratorAction],
+    label: str,
+) -> list[StepRecord]:
+    obs = env.reset(seed=seed)
+    print(f"\n{'=' * 72}")
+    print(f"Episode: {label}")
+    print(f"Seed: {seed}")
+    print(
+        f"  reset  score={obs.p1_score:.6f}  feasibility={obs.p1_feasibility:.6f}  "
+        f"constraints={'yes' if obs.constraints_satisfied else 'no'}  "
+        f"elongation={obs.max_elongation:.4f}  budget={obs.budget_remaining}"
+    )
+    records: list[StepRecord] = []
+    for i, action in enumerate(actions, start=1):
+        if obs.done:
+            print(f"  (episode ended before step {i})")
+            break
+        obs = env.step(action)
+        rec = _record(obs, i, action)
+        records.append(rec)
+        status = (
+            "FAIL" if rec.evaluation_failed else ("OK" if rec.constraints_satisfied else "viol")
+        )
+        print(
+            f"  step {i:2d}  {rec.action_label:<42s}  "
+            f"reward={rec.reward:+8.4f}  score={rec.score:.6f}  "
+            f"feas={rec.feasibility:.6f}  elong={rec.max_elongation:.4f}  "
+            f"status={status}  budget={rec.budget_remaining}  "
+            f"{'DONE' if rec.done else ''}"
+        )
+    total_reward = sum(r.reward for r in records)
+    print(f"  total_reward={total_reward:+.4f}")
+    return records
+def _run(action: str, param: str, direction: str, magnitude: str) -> StellaratorAction:
+    return StellaratorAction(
+        intent="run",
+        parameter=param,
+        direction=direction,
+        magnitude=magnitude,
+    )
+def _submit() -> StellaratorAction:
+    return StellaratorAction(intent="submit")
+def _restore() -> StellaratorAction:
+    return StellaratorAction(intent="restore_best")
+# ── Episode definitions ──────────────────────────────────────────────────
+EPISODE_1 = (
+    "seed0_repair_objective_exhaustion",
+    0,
+    [
+        _run("run", "triangularity_scale", "increase", "medium"),  # cross feasibility
+        _run("run", "elongation", "decrease", "small"),  # feasible-side shaping
+        _run("run", "elongation", "decrease", "small"),  # more shaping
+        _run("run", "elongation", "decrease", "small"),  # more shaping
+        _run("run", "elongation", "decrease", "small"),  # more shaping
+        _run("run", "elongation", "decrease", "small"),  # budget=0 → done bonus
+    ],
+)
+EPISODE_2 = (
+    "seed1_repair_different_seed",
+    1,
+    [
+        _run(
+            "run", "triangularity_scale", "increase", "medium"
+        ),  # cross feasibility from ar=3.4,rt=1.6
+        _run("run", "elongation", "decrease", "small"),  # feasible-side shaping
+        _run("run", "elongation", "decrease", "small"),  # more shaping
+        _run("run", "triangularity_scale", "increase", "small"),  # push tri further
+        _run("run", "elongation", "decrease", "small"),  # more shaping
+        _run("run", "elongation", "decrease", "small"),  # budget exhaustion
+    ],
+)
+EPISODE_3 = (
+    "seed2_boundary_clamping",
+    2,
+    [
+        _run("run", "aspect_ratio", "increase", "large"),  # ar=3.8 + 0.2 → clamped at 3.8
+        _run("run", "triangularity_scale", "increase", "medium"),  # repair toward feasibility
+        _run("run", "triangularity_scale", "increase", "medium"),  # push further
+        _run("run", "elongation", "decrease", "small"),  # shaping if feasible
+        _run("run", "aspect_ratio", "decrease", "large"),  # move ar down
+        _run("run", "elongation", "decrease", "small"),  # budget exhaustion
+    ],
+)
+EPISODE_4 = (
+    "seed0_crash_recovery_restore",
+    0,
+    [
+        _run("run", "triangularity_scale", "increase", "medium"),  # cross feasibility first
+        _run("run", "rotational_transform", "increase", "large"),  # rt 1.5→1.7
+        _run("run", "rotational_transform", "increase", "large"),  # rt 1.7→1.9 (crash zone)
+        _restore(),  # recover best state
+        _run("run", "elongation", "decrease", "small"),  # continue from best
+        _run("run", "elongation", "decrease", "small"),  # budget exhaustion
+    ],
+)
+EPISODE_5 = (
+    "seed0_repair_objective_submit",
+    0,
+    [
+        _run("run", "triangularity_scale", "increase", "medium"),  # cross feasibility
+        _run("run", "elongation", "decrease", "small"),  # feasible-side objective
+        _submit(),  # explicit high-fidelity submit
+    ],
+)
+ALL_EPISODES = [EPISODE_1, EPISODE_2, EPISODE_3, EPISODE_4, EPISODE_5]
+def main(output_json: str | None = None) -> None:
+    env = StellaratorEnvironment()
+    all_results: dict[str, list[dict[str, object]]] = {}
+    for label, seed, actions in ALL_EPISODES:
+        records = _run_episode(env, seed, actions, label)
+        all_results[label] = [asdict(r) for r in records]
+    if output_json:
+        with open(output_json, "w") as f:
+            json.dump(all_results, f, indent=2)
+        print(f"\nResults written to {output_json}")
+if __name__ == "__main__":
+    out = sys.argv[1] if len(sys.argv) > 1 else None
+    main(output_json=out)

docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED Viewed

@@ -80,6 +80,8 @@ Practical fail-fast rule:
 - allow a tiny low-fidelity PPO smoke run before full submit-side validation
 - use it only to surface obvious learnability bugs, reward exploits, or action-space problems
 - do not use low-fidelity training alone as proof that the terminal `submit` contract is trustworthy
 ## 5. Document Roles
@@ -133,8 +135,8 @@ The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V
 ## 8. Execution Order
-- [ ] Run a tiny low-fidelity PPO smoke pass and inspect a few trajectories for obvious learnability failures or reward exploits.
-- [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit checks.
 - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
 - [ ] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
 - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
@@ -155,10 +157,12 @@ Gate 2: tiny PPO smoke is sane
 - a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
 - trajectories are readable enough to debug
 Gate 3: fixture checks pass
 - good, boundary, and bad references behave as expected
 Gate 4: manual playtest passes

 - allow a tiny low-fidelity PPO smoke run before full submit-side validation
 - use it only to surface obvious learnability bugs, reward exploits, or action-space problems
+- stop after a few readable trajectories or one clear failure mode
+- run paired high-fidelity fixture checks and one real submit-side trace immediately after the smoke run
 - do not use low-fidelity training alone as proof that the terminal `submit` contract is trustworthy
 ## 5. Document Roles
 ## 8. Execution Order
+- [ ] Run a tiny low-fidelity PPO smoke pass and stop after a few trajectories once it reveals either readable behavior or one clear failure mode.
+- [ ] Pair the tracked low-fidelity fixtures with high-fidelity submit checks immediately after the PPO smoke pass.
 - [ ] Decide whether the reset pool should change based on the measured sweep plus those paired checks.
 - [ ] Run at least one submit-side manual trace, then expand to 5 to 10 episodes and record the first real confusion point, exploit, or reward pathology.
 - [ ] Adjust reward or penalties only if playtesting exposes a concrete problem.
 - a small low-fidelity policy can improve or at least reveal a concrete failure mode quickly
 - trajectories are readable enough to debug
+- the smoke run stops at that diagnostic threshold instead of turning into a broader training phase
 Gate 3: fixture checks pass
 - good, boundary, and bad references behave as expected
+- the paired high-fidelity checks happen immediately after the PPO smoke run, not as optional later work
 Gate 4: manual playtest passes

docs/archive/FUSION_NEXT_12_HOURS_CHECKLIST.md CHANGED Viewed

@@ -14,8 +14,9 @@ Use these docs instead:
 Current execution priority remains:
 1. measured sweep
-2. tracked fixtures
-3. manual playtest
-4. heuristic baseline refresh
-5. HF Space proof
-6. notebook, demo, and repo polish

 Current execution priority remains:
 1. measured sweep
+2. tiny PPO smoke pass as a diagnostic-only check
+3. tracked fixtures with paired high-fidelity submit checks
+4. one submit-side manual trace, then broader manual playtest
+5. heuristic baseline refresh
+6. HF Space proof
+7. notebook, demo, and repo polish

training/notebooks/northflank_smoke.py CHANGED Viewed

@@ -7,10 +7,8 @@ from datetime import UTC, datetime
 from importlib.metadata import version
 from pathlib import Path
-from constellaration.initial_guess import generate_rotating_ellipse
-from server.environment import BASELINE_PARAMS, N_FIELD_PERIODS
-from server.physics import EvaluationMetrics, evaluate_params
 DEFAULT_OUTPUT_DIR = Path("training/notebooks/artifacts")
@@ -23,15 +21,15 @@ class SmokeArtifact:
     boundary_type: str
     n_field_periods: int
     params: dict[str, float]
-    metrics: dict[str, float | bool]
 def parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(
         description=(
             "Run the Fusion Design Lab Northflank smoke check: generate one "
-            "rotating-ellipse boundary, run one low-fidelity verifier call, "
-            "and write a JSON artifact."
         )
     )
     parser.add_argument(
@@ -47,23 +45,17 @@ def parse_args() -> argparse.Namespace:
 def build_artifact() -> SmokeArtifact:
-    boundary = generate_rotating_ellipse(
-        aspect_ratio=BASELINE_PARAMS.aspect_ratio,
-        elongation=BASELINE_PARAMS.elongation,
-        rotational_transform=BASELINE_PARAMS.rotational_transform,
-        n_field_periods=N_FIELD_PERIODS,
-    )
-    metrics = evaluate_params(
-        BASELINE_PARAMS,
         n_field_periods=N_FIELD_PERIODS,
-        fidelity="low",
     )
     return SmokeArtifact(
         created_at_utc=datetime.now(UTC).isoformat(),
         constellaration_version=version("constellaration"),
         boundary_type=type(boundary).__name__,
         n_field_periods=N_FIELD_PERIODS,
-        params=BASELINE_PARAMS.model_dump(),
         metrics=_metrics_payload(metrics),
     )
@@ -76,8 +68,11 @@ def write_artifact(output_dir: Path, artifact: SmokeArtifact) -> Path:
     return output_path
-def _metrics_payload(metrics: EvaluationMetrics) -> dict[str, float | bool]:
     return {
         "max_elongation": metrics.max_elongation,
         "aspect_ratio": metrics.aspect_ratio,
         "average_triangularity": metrics.average_triangularity,

 from importlib.metadata import version
 from pathlib import Path
+from server.contract import N_FIELD_PERIODS, SMOKE_TEST_PARAMS
+from server.physics import EvaluationMetrics, build_boundary_from_params, evaluate_boundary
 DEFAULT_OUTPUT_DIR = Path("training/notebooks/artifacts")
     boundary_type: str
     n_field_periods: int
     params: dict[str, float]
+    metrics: dict[str, str | float | bool]
 def parse_args() -> argparse.Namespace:
     parser = argparse.ArgumentParser(
         description=(
             "Run the Fusion Design Lab Northflank smoke check: generate one "
+            "rotating-ellipse-derived low-dimensional boundary, run one "
+            "low-fidelity verifier call, and write a JSON artifact."
         )
     )
     parser.add_argument(
 def build_artifact() -> SmokeArtifact:
+    boundary = build_boundary_from_params(
+        SMOKE_TEST_PARAMS,
         n_field_periods=N_FIELD_PERIODS,
     )
+    metrics = evaluate_boundary(boundary, fidelity="low")
     return SmokeArtifact(
         created_at_utc=datetime.now(UTC).isoformat(),
         constellaration_version=version("constellaration"),
         boundary_type=type(boundary).__name__,
         n_field_periods=N_FIELD_PERIODS,
+        params=SMOKE_TEST_PARAMS.model_dump(),
         metrics=_metrics_payload(metrics),
     )
     return output_path
+def _metrics_payload(metrics: EvaluationMetrics) -> dict[str, str | float | bool]:
     return {
+        "evaluation_fidelity": metrics.evaluation_fidelity,
+        "evaluation_failed": metrics.evaluation_failed,
+        "failure_reason": metrics.failure_reason,
         "max_elongation": metrics.max_elongation,
         "aspect_ratio": metrics.aspect_ratio,
         "average_triangularity": metrics.average_triangularity,