Spaces:

CreativeEngineer
/

fusion-design-lab

Paused

App Files Files Community

CreativeEngineer commited on Mar 8

Commit

2348d3e

1 Parent(s): 9827b11

docs: clarify notebook surfaces and OpenEnv guidance

Browse files

Files changed (4) hide show

README.md +6 -6
docs/findings/FUSION_DESIGN_LAB_PLAN_V2.md +5 -5
training/notebooks/README.md +9 -5
training/notebooks/fusion_design_lab_training.ipynb +40 -146

README.md CHANGED Viewed

@@ -3,7 +3,7 @@
 Fusion Design Lab is an environment-first [OpenEnv](https://openenv.dev) hackathon project for the `P1` stellarator benchmark.
 **Live Environment**: [HF Space](https://huggingface.co/spaces/CreativeEngineer/fusion-design-lab)
-**Training Notebook**: [Colab (GRPO + Unsloth)](training/notebooks/fusion_design_lab_training.ipynb)
 ## What It Does
@@ -57,7 +57,7 @@ The environment uses [`constellaration`](https://pypi.org/project/constellaratio
 - [x] Complete paired high-fidelity fixture checks and at least one real submit-side manual trace before any broader training push
 - [x] Refresh the heuristic baseline for the real verifier path
 - [x] Deploy the real environment to HF Space
-- [x] Add the Colab training notebook under `training/notebooks`
 ## Known Gaps
@@ -121,13 +121,13 @@ uv sync --extra notebooks
 - Recommended compute workspace: Northflank Jupyter Notebook with PyTorch on the team H100
 - OpenEnv deployment target: Hugging Face Spaces
-- Minimal submission notebook target: Colab
-- Required notebook artifact: one public Colab notebook that demonstrates trained-policy behavior against the environment
 - Verifier of record: `constellaration.problems.GeometricalProblem`
 - Environment style: fresh wiring in this repo, not a port of the old `ai-sci-feasible-designs` harness
 - Northflank containers are ephemeral, so persistent storage should be attached before relying on saved models, caches, or fixture data
 - Preferred deployment path: push this GitHub repo and let HF Space build from the repo/Docker configuration rather than copying code manually
-- Preferred Colab/HF Space connectivity: make the HF Space public for the hackathon unless privacy becomes necessary; if private, document and use an explicit access token in the notebook
 ## Immediate Next Steps
@@ -139,7 +139,7 @@ uv sync --extra notebooks
 - [ ] Save one presentation-ready comparison trace from the refreshed heuristic baseline.
 - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
 - [x] Deploy the environment to HF Space.
-- [x] Add the Colab notebook under `training/notebooks`.
 These are implementation steps, not another planning phase.

 Fusion Design Lab is an environment-first [OpenEnv](https://openenv.dev) hackathon project for the `P1` stellarator benchmark.
 **Live Environment**: [HF Space](https://huggingface.co/spaces/CreativeEngineer/fusion-design-lab)
+**Training Notebook**: [Repository Notebook (GRPO + Unsloth)](training/notebooks/fusion_design_lab_training.ipynb)
 ## What It Does
 - [x] Complete paired high-fidelity fixture checks and at least one real submit-side manual trace before any broader training push
 - [x] Refresh the heuristic baseline for the real verifier path
 - [x] Deploy the real environment to HF Space
+- [x] Add the public training notebook under `training/notebooks`
 ## Known Gaps
 - Recommended compute workspace: Northflank Jupyter Notebook with PyTorch on the team H100
 - OpenEnv deployment target: Hugging Face Spaces
+- Submission notebook surface: one public notebook artifact; mirror it to Colab if the submission form still requires Colab specifically
+- Required notebook artifact: one public notebook that demonstrates trained-policy behavior against the environment
 - Verifier of record: `constellaration.problems.GeometricalProblem`
 - Environment style: fresh wiring in this repo, not a port of the old `ai-sci-feasible-designs` harness
 - Northflank containers are ephemeral, so persistent storage should be attached before relying on saved models, caches, or fixture data
 - Preferred deployment path: push this GitHub repo and let HF Space build from the repo/Docker configuration rather than copying code manually
+- Preferred notebook/HF Space connectivity: make the HF Space public for the hackathon unless privacy becomes necessary; if private, document and use an explicit access token in the notebook
 ## Immediate Next Steps
 - [ ] Save one presentation-ready comparison trace from the refreshed heuristic baseline.
 - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
 - [x] Deploy the environment to HF Space.
+- [x] Add the public training notebook under `training/notebooks`.
 These are implementation steps, not another planning phase.

docs/findings/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED Viewed

@@ -42,7 +42,7 @@ Still open:
 - decision on whether reset-seed pool should change from paired checks
 - HF Space deployment evidence
-- Colab artifact wiring
 - demo and README polish after the artifacts are real
 Current caution:
@@ -99,7 +99,7 @@ Use the docs like this:
 Visible artifacts:
 - [ ] HF Space environment
-- [ ] Required Colab notebook
 - [ ] 1-minute demo video
 - [x] Public repo and README
@@ -107,7 +107,7 @@ Compute surfaces:
 - Northflank is the main compute workspace for verifier-heavy work
 - HF Space is the hosted environment surface
-- Colab is the required public artifact and should show trained-policy behavior against the live environment
 - trained-policy work should still iterate on low-fidelity `run`; use high-fidelity `submit` only for sparse checkpoint evaluation and final evidence
 Evidence order:
@@ -145,7 +145,7 @@ The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V
 - [x] Refresh the heuristic baseline using the repaired-family evidence.
 - [ ] Prove a stable local episode path.
 - [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
-- [ ] Wire the Colab artifact to the live environment.
 - [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
 - [ ] Polish the public repo only after the artifacts above exist.
@@ -187,7 +187,7 @@ Gate 7: remote surface is real
 Gate 8: submission artifacts exist
-- Colab, demo, and README all reflect the actual environment rather than a hypothetical future one
 ## 10. Fallback Rules

 - decision on whether reset-seed pool should change from paired checks
 - HF Space deployment evidence
+- public notebook artifact wiring
 - demo and README polish after the artifacts are real
 Current caution:
 Visible artifacts:
 - [ ] HF Space environment
+- [ ] Public submission notebook
 - [ ] 1-minute demo video
 - [x] Public repo and README
 - Northflank is the main compute workspace for verifier-heavy work
 - HF Space is the hosted environment surface
+- the public notebook artifact should show trained-policy behavior against the live environment and can be mirrored to Colab if the submission form still requires it
 - trained-policy work should still iterate on low-fidelity `run`; use high-fidelity `submit` only for sparse checkpoint evaluation and final evidence
 Evidence order:
 - [x] Refresh the heuristic baseline using the repaired-family evidence.
 - [ ] Prove a stable local episode path.
 - [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
+- [ ] Wire the public notebook artifact to the live environment.
 - [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
 - [ ] Polish the public repo only after the artifacts above exist.
 Gate 8: submission artifacts exist
+- the public notebook artifact, demo, and README all reflect the actual environment rather than a hypothetical future one
 ## 10. Fallback Rules

training/notebooks/README.md CHANGED Viewed

@@ -4,14 +4,14 @@ Use this directory for the notebooks that support the hackathon submission.
 Expected contents:
-- one Colab-friendly notebook that connects to the deployed HF Space
 - one Northflank-friendly notebook path for verifier sanity checks, manual reward iteration, baselines, or training/debugging
 Recommended split:
 - Northflank notebook: main compute workspace on the team H100
-- Colab notebook: thin public artifact required by the hackathon
-- trained model: required; the Colab notebook should include a trained-policy demonstration even if performance is modest
 ## Status
@@ -19,7 +19,7 @@ Recommended split:
 - [x] runnable Northflank smoke script saved
 - [x] Northflank smoke test passed on the team H100
 - [ ] manual-playtest notebook or trace notebook saved
-- [ ] thin public Colab notebook saved
 Operational defaults:
@@ -27,7 +27,7 @@ Operational defaults:
 - keep heavy verifier and training work on Northflank
 - keep low-fidelity `run` as the training inner loop; do not put high-fidelity `submit` in every RL step
 - use high-fidelity `submit` only for sparse checkpoint evaluation, paired fixture checks, manual traces, and final evidence
-- keep the Colab notebook focused on connecting to the deployed HF Space and exporting visible traces
 - prefer a public HF Space for the hackathon; if private, document the token setup directly in the notebook
 Northflank smoke gate:
@@ -43,4 +43,8 @@ Runnable repo path:
 - note: `training/notebooks/NORTHFLANK_SMOKE_NOTE.md`
 - latest passing artifact example: `/home/jovyan/fusion-design-lab/smoke/northflank_smoke_20260308T023646Z.json`
 The notebooks are supporting evidence for the environment, not the primary product. The required artifact is the notebook plus trained-policy evidence; a standalone checkpoint file is optional only if the notebook can still demonstrate the trained behavior.

 Expected contents:
+- one public notebook artifact that connects to the deployed HF Space; mirror it to Colab if the submission surface requires Colab specifically
 - one Northflank-friendly notebook path for verifier sanity checks, manual reward iteration, baselines, or training/debugging
 Recommended split:
 - Northflank notebook: main compute workspace on the team H100
+- public notebook artifact: thin submission surface, mirrored to Colab only if the submission form still requires it
+- trained model: required; the public notebook should include a trained-policy demonstration even if performance is modest
 ## Status
 - [x] runnable Northflank smoke script saved
 - [x] Northflank smoke test passed on the team H100
 - [ ] manual-playtest notebook or trace notebook saved
+- [ ] public submission notebook link saved
 Operational defaults:
 - keep heavy verifier and training work on Northflank
 - keep low-fidelity `run` as the training inner loop; do not put high-fidelity `submit` in every RL step
 - use high-fidelity `submit` only for sparse checkpoint evaluation, paired fixture checks, manual traces, and final evidence
+- keep the public submission notebook focused on connecting to the deployed HF Space and exporting visible traces
 - prefer a public HF Space for the hackathon; if private, document the token setup directly in the notebook
 Northflank smoke gate:
 - note: `training/notebooks/NORTHFLANK_SMOKE_NOTE.md`
 - latest passing artifact example: `/home/jovyan/fusion-design-lab/smoke/northflank_smoke_20260308T023646Z.json`
+LLM notebook helpers should use the packaged prompt/action contract in:
+- `fusion_lab/llm_agent.py`
 The notebooks are supporting evidence for the environment, not the primary product. The required artifact is the notebook plus trained-policy evidence; a standalone checkpoint file is optional only if the notebook can still demonstrate the trained behavior.

training/notebooks/fusion_design_lab_training.ipynb CHANGED Viewed

@@ -87,7 +87,7 @@
    "cell_type": "markdown",
    "id": "8edb47106e1a46a883d545849b8ab81b",
    "metadata": {},
-   "source": "## 3. Setup Stellarator Environment\n\nInstall the environment package directly from the repository so training runs locally (no network latency per step). The same environment is deployed at the HF Space URL above."
   },
   {
    "cell_type": "code",
@@ -284,13 +284,7 @@
    "cell_type": "markdown",
    "id": "504fb2a444614c0babb325280ed9130a",
    "metadata": {},
-   "source": [
-    "## 6. Reward Functions\n",
-    "\n",
-    "Two reward signals:\n",
-    "1. **Format reward**: Does the completion contain a valid JSON action plan?\n",
-    "2. **Environment reward**: Execute the plan in the stellarator environment and return cumulative reward."
-   ]
   },
   {
    "cell_type": "code",
@@ -298,72 +292,7 @@
    "id": "59bbdb311c014d738909a11f9e486628",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "import traceback\n",
-    "\n",
-    "\n",
-    "def format_reward_fn(completions: list[str], **kwargs) -> list[float]:\n",
-    "    \"\"\"Reward for producing a valid, parseable action plan.\"\"\"\n",
-    "    rewards = []\n",
-    "    for completion in completions:\n",
-    "        actions = parse_action_plan(completion)\n",
-    "        if len(actions) == 0:\n",
-    "            rewards.append(-1.0)\n",
-    "        elif any(a.intent == \"submit\" for a in actions):\n",
-    "            rewards.append(1.0)  # valid plan ending with submit\n",
-    "        else:\n",
-    "            rewards.append(0.0)  # valid actions but no submit\n",
-    "    return rewards\n",
-    "\n",
-    "\n",
-    "def environment_reward_fn(\n",
-    "    completions: list[str], seed_idx: list[int] | None = None, **kwargs\n",
-    ") -> list[float]:\n",
-    "    \"\"\"Execute each action plan in the environment and return cumulative reward.\"\"\"\n",
-    "    rewards = []\n",
-    "    seeds = seed_idx if seed_idx is not None else [0] * len(completions)\n",
-    "    for i, completion in enumerate(completions):\n",
-    "        try:\n",
-    "            actions = parse_action_plan(completion)\n",
-    "            if len(actions) == 0:\n",
-    "                rewards.append(-3.0)\n",
-    "                continue\n",
-    "            env = StellaratorEnvironment()\n",
-    "            env.reset(seed=int(seeds[i]) % len(RESET_SEEDS))\n",
-    "            total_reward = 0.0\n",
-    "            for action in actions[:BUDGET]:\n",
-    "                obs = env.step(action)\n",
-    "                total_reward += float(obs.reward or 0.0)\n",
-    "                if obs.done:\n",
-    "                    break\n",
-    "            rewards.append(total_reward)\n",
-    "        except Exception:\n",
-    "            traceback.print_exc()\n",
-    "            rewards.append(-3.0)\n",
-    "    return rewards\n",
-    "\n",
-    "\n",
-    "# Test reward functions with a hand-crafted plan\n",
-    "test_plan = json.dumps(\n",
-    "    [\n",
-    "        {\n",
-    "            \"intent\": \"run\",\n",
-    "            \"parameter\": \"triangularity_scale\",\n",
-    "            \"direction\": \"increase\",\n",
-    "            \"magnitude\": \"small\",\n",
-    "        },\n",
-    "        {\n",
-    "            \"intent\": \"run\",\n",
-    "            \"parameter\": \"rotational_transform\",\n",
-    "            \"direction\": \"increase\",\n",
-    "            \"magnitude\": \"medium\",\n",
-    "        },\n",
-    "        {\"intent\": \"submit\"},\n",
-    "    ]\n",
-    ")\n",
-    "print(f\"Format reward: {format_reward_fn([test_plan])}\")\n",
-    "print(f\"Environment reward: {environment_reward_fn([test_plan], seed_idx=[0])}\")"
-   ]
   },
   {
    "cell_type": "markdown",
@@ -381,41 +310,7 @@
    "id": "8a65eabff63a45729fe45fb5ade58bdc",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "from trl import GRPOConfig, GRPOTrainer\n",
-    "\n",
-    "MAX_PROMPT_LENGTH = 768\n",
-    "MAX_COMPLETION_LENGTH = MAX_SEQ_LENGTH - MAX_PROMPT_LENGTH\n",
-    "\n",
-    "training_args = GRPOConfig(\n",
-    "    output_dir=\"./grpo_fusion_output\",\n",
-    "    learning_rate=2e-4,\n",
-    "    num_generations=4,\n",
-    "    max_completion_length=MAX_COMPLETION_LENGTH,\n",
-    "    max_prompt_length=MAX_PROMPT_LENGTH,\n",
-    "    per_device_train_batch_size=4,\n",
-    "    gradient_accumulation_steps=1,\n",
-    "    max_steps=60,\n",
-    "    temperature=1.0,\n",
-    "    logging_steps=1,\n",
-    "    save_steps=20,\n",
-    "    bf16=True,\n",
-    "    report_to=\"none\",\n",
-    "    seed=42,\n",
-    ")\n",
-    "\n",
-    "trainer = GRPOTrainer(\n",
-    "    model=model,\n",
-    "    processing_class=tokenizer,\n",
-    "    reward_funcs=[format_reward_fn, environment_reward_fn],\n",
-    "    args=training_args,\n",
-    "    train_dataset=dataset,\n",
-    ")\n",
-    "\n",
-    "print(\"Starting GRPO training...\")\n",
-    "train_result = trainer.train()\n",
-    "print(f\"Training complete. Total steps: {train_result.global_step}\")"
-   ]
   },
   {
    "cell_type": "markdown",
@@ -518,7 +413,7 @@
     "    total_reward = 0.0\n",
     "    for action in actions[:BUDGET]:\n",
     "        obs = env.step(action)\n",
-    "        r = float(obs.reward or 0.0)\n",
     "        total_reward += r\n",
     "        trace.append(\n",
     "            f\"  {action.intent} {action.parameter or ''} {action.direction or ''} {action.magnitude or ''} → reward={r:.3f} score={obs.p1_score:.4f} feasible={obs.constraints_satisfied}\".strip()\n",
@@ -537,12 +432,12 @@
     "        spec = random.choice(AVAILABLE_ACTIONS[:24])  # run actions only\n",
     "        action = StellaratorAction(**spec)\n",
     "        obs = env.step(action)\n",
-    "        total_reward += float(obs.reward or 0.0)\n",
     "        if obs.done:\n",
     "            return total_reward\n",
     "    # submit on last step\n",
     "    obs = env.step(StellaratorAction(intent=\"submit\"))\n",
-    "    total_reward += float(obs.reward or 0.0)\n",
     "    return total_reward\n",
     "\n",
     "\n",
@@ -579,7 +474,7 @@
    "cell_type": "markdown",
    "id": "cb1e1581032b452c9409d6c6813c49d1",
    "metadata": {},
-   "source": "## 10. Connect to Deployed HF Space\n\nDemonstrate connecting to the live environment on Hugging Face Spaces and running the trained model against it."
   },
   {
    "cell_type": "code",
@@ -590,7 +485,7 @@
    "source": [
     "import requests\n",
     "\n",
-    "from fusion_lab.models import StellaratorObservation\n",
     "\n",
     "HF_SPACE_URL = \"https://creativeengineer-fusion-design-lab.hf.space\"\n",
     "\n",
@@ -604,39 +499,38 @@
     "print(f\"Constraints: {task['constraints']}\")\n",
     "print(f\"Budget: {task['budget']}\")\n",
     "\n",
-    "# Reset an episode on the remote environment\n",
-    "resp = requests.post(f\"{HF_SPACE_URL}/reset\", json={\"seed\": 42}).json()\n",
-    "obs_data = resp[\"observation\"]\n",
-    "print(f\"\\nRemote reset — max_elongation: {obs_data['max_elongation']:.4f}\")\n",
-    "print(f\"  aspect_ratio: {obs_data['aspect_ratio']:.4f}\")\n",
-    "print(f\"  constraints_satisfied: {obs_data['constraints_satisfied']}\")\n",
-    "print(f\"  budget_remaining: {obs_data['budget_remaining']}\")\n",
-    "\n",
-    "# Generate an action plan from the trained model\n",
-    "remote_obs = StellaratorObservation.model_validate(obs_data)\n",
-    "prompt = build_prompt(remote_obs)\n",
-    "inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n",
-    "outputs = model.generate(\n",
-    "    **inputs, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.7, do_sample=True\n",
-    ")\n",
-    "completion = tokenizer.decode(outputs[0][inputs[\"input_ids\"].shape[1] :], skip_special_tokens=True)\n",
-    "actions = parse_action_plan(completion)\n",
-    "\n",
-    "print(f\"\\nTrained model generated {len(actions)} actions for remote env:\")\n",
-    "for i, action in enumerate(actions[:BUDGET]):\n",
-    "    action_payload = action.model_dump(exclude_none=True)\n",
-    "    step_resp = requests.post(f\"{HF_SPACE_URL}/step\", json={\"action\": action_payload}).json()\n",
-    "    r = step_resp.get(\"reward\", 0)\n",
-    "    done = step_resp.get(\"done\", False)\n",
-    "    step_obs = step_resp[\"observation\"]\n",
-    "    print(\n",
-    "        f\"  Step {i + 1}: {action.intent} {action.parameter or ''} \"\n",
-    "        f\"{action.direction or ''} {action.magnitude or ''} \"\n",
-    "        f\"→ reward={r:.3f}, score={step_obs['p1_score']:.4f}\"\n",
     "    )\n",
-    "    if done:\n",
-    "        print(f\"  Episode done. Final score: {step_obs['p1_score']:.4f}\")\n",
-    "        break\n",
     "\n",
     "print(\"\\nEnvironment is live and accessible for training and evaluation.\")"
    ]

    "cell_type": "markdown",
    "id": "8edb47106e1a46a883d545849b8ab81b",
    "metadata": {},
+   "source": "## 3. Setup Stellarator Environment\n\nInstall the environment package directly from the HF Space repository so training runs locally (no network latency per step). The package also includes the typed `FusionLabClient` and Pydantic models for remote OpenEnv sessions."
   },
   {
    "cell_type": "code",
    "cell_type": "markdown",
    "id": "504fb2a444614c0babb325280ed9130a",
    "metadata": {},
+   "source": "## 6. Reward Functions\n\nTwo reward signals for GRPO:\n1. **Format reward**: Is the completion a valid JSON action plan?\n2. **Environment reward**: Execute the plan in the stellarator environment and return the cumulative reward. The environment's built-in reward already decomposes feasibility (+3/−3 crossing bonuses, feasibility progress), objective (max elongation improvement), step costs, submit bonuses, and failure penalties — see `server/environment.py:_compute_reward`."
   },
   {
    "cell_type": "code",
    "id": "59bbdb311c014d738909a11f9e486628",
    "metadata": {},
    "outputs": [],
+   "source": "import traceback\n\n\ndef format_reward_fn(completions: list[str], **kwargs) -> list[float]:\n    \"\"\"Reward for producing a valid, parseable action plan.\"\"\"\n    rewards = []\n    for completion in completions:\n        actions = parse_action_plan(completion)\n        if len(actions) == 0:\n            rewards.append(-1.0)\n        elif any(a.intent == \"submit\" for a in actions):\n            rewards.append(1.0)\n        else:\n            rewards.append(0.0)\n    return rewards\n\n\ndef environment_reward_fn(\n    completions: list[str], seed_idx: list[int] | None = None, **kwargs\n) -> list[float]:\n    \"\"\"Execute each action plan in the environment and return cumulative reward.\n\n    The environment's _compute_reward already includes:\n    - Feasibility crossing bonuses (+3/-3)\n    - Infeasible progress: 5.0 * delta_feasibility\n    - Feasible improvement: 10.0 * delta_max_elongation\n    - Submit improvement bonus: 5.0 * ratio + budget_fraction\n    - Step cost (-0.1), failure penalties, recovery bonuses\n    \"\"\"\n    rewards = []\n    seeds = seed_idx if seed_idx is not None else [0] * len(completions)\n    for i, completion in enumerate(completions):\n        try:\n            actions = parse_action_plan(completion)\n            if len(actions) == 0:\n                rewards.append(-3.0)\n                continue\n            env = StellaratorEnvironment()\n            env.reset(seed=int(seeds[i]) % len(RESET_SEEDS))\n            total_reward = 0.0\n            for action in actions[:BUDGET]:\n                obs = env.step(action)\n                total_reward += float(obs.reward) if obs.reward is not None else 0.0\n                if obs.done:\n                    break\n            rewards.append(total_reward)\n        except Exception:\n            traceback.print_exc()\n            rewards.append(-3.0)\n    return rewards\n\n\n# Test reward functions with a hand-crafted plan\ntest_plan = json.dumps(\n    [\n        {\n            \"intent\": \"run\",\n            \"parameter\": \"triangularity_scale\",\n            \"direction\": \"increase\",\n            \"magnitude\": \"small\",\n        },\n        {\n            \"intent\": \"run\",\n            \"parameter\": \"rotational_transform\",\n            \"direction\": \"increase\",\n            \"magnitude\": \"medium\",\n        },\n        {\"intent\": \"submit\"},\n    ]\n)\nprint(f\"Format reward:      {format_reward_fn([test_plan])}\")\nprint(f\"Environment reward: {environment_reward_fn([test_plan], seed_idx=[0])}\")"
   },
   {
    "cell_type": "markdown",
    "id": "8a65eabff63a45729fe45fb5ade58bdc",
    "metadata": {},
    "outputs": [],
+   "source": "from trl import GRPOConfig, GRPOTrainer\n\nMAX_PROMPT_LENGTH = 768\nMAX_COMPLETION_LENGTH = MAX_SEQ_LENGTH - MAX_PROMPT_LENGTH\n\ntraining_args = GRPOConfig(\n    output_dir=\"./grpo_fusion_output\",\n    learning_rate=2e-4,\n    num_generations=4,\n    max_completion_length=MAX_COMPLETION_LENGTH,\n    max_prompt_length=MAX_PROMPT_LENGTH,\n    per_device_train_batch_size=4,\n    gradient_accumulation_steps=1,\n    max_steps=60,\n    temperature=1.0,\n    logging_steps=1,\n    save_steps=20,\n    bf16=True,\n    report_to=\"none\",\n    seed=42,\n)\n\ntrainer = GRPOTrainer(\n    model=model,\n    processing_class=tokenizer,\n    reward_funcs=[format_reward_fn, environment_reward_fn],\n    args=training_args,\n    train_dataset=dataset,\n)\n\nprint(\"Starting GRPO training...\")\ntrain_result = trainer.train()\nprint(f\"Training complete. Total steps: {train_result.global_step}\")"
   },
   {
    "cell_type": "markdown",
     "    total_reward = 0.0\n",
     "    for action in actions[:BUDGET]:\n",
     "        obs = env.step(action)\n",
+    "        r = float(obs.reward) if obs.reward is not None else 0.0\n",
     "        total_reward += r\n",
     "        trace.append(\n",
     "            f\"  {action.intent} {action.parameter or ''} {action.direction or ''} {action.magnitude or ''} → reward={r:.3f} score={obs.p1_score:.4f} feasible={obs.constraints_satisfied}\".strip()\n",
     "        spec = random.choice(AVAILABLE_ACTIONS[:24])  # run actions only\n",
     "        action = StellaratorAction(**spec)\n",
     "        obs = env.step(action)\n",
+    "        total_reward += float(obs.reward) if obs.reward is not None else 0.0\n",
     "        if obs.done:\n",
     "            return total_reward\n",
     "    # submit on last step\n",
     "    obs = env.step(StellaratorAction(intent=\"submit\"))\n",
+    "    total_reward += float(obs.reward) if obs.reward is not None else 0.0\n",
     "    return total_reward\n",
     "\n",
     "\n",
    "cell_type": "markdown",
    "id": "cb1e1581032b452c9409d6c6813c49d1",
    "metadata": {},
+   "source": "## 10. Connect to Deployed HF Space\n\nDemonstrate connecting to the live environment on Hugging Face Spaces through the typed OpenEnv client and running the trained model against it."
   },
   {
    "cell_type": "code",
    "source": [
     "import requests\n",
     "\n",
+    "from fusion_lab.client import FusionLabClient\n",
     "\n",
     "HF_SPACE_URL = \"https://creativeengineer-fusion-design-lab.hf.space\"\n",
     "\n",
     "print(f\"Constraints: {task['constraints']}\")\n",
     "print(f\"Budget: {task['budget']}\")\n",
     "\n",
+    "with FusionLabClient(base_url=HF_SPACE_URL) as env:\n",
+    "    reset_result = env.reset(seed=42)\n",
+    "    remote_obs = reset_result.observation\n",
+    "    print(f\"\\nRemote reset — max_elongation: {remote_obs.max_elongation:.4f}\")\n",
+    "    print(f\"  aspect_ratio: {remote_obs.aspect_ratio:.4f}\")\n",
+    "    print(f\"  constraints_satisfied: {remote_obs.constraints_satisfied}\")\n",
+    "    print(f\"  budget_remaining: {remote_obs.budget_remaining}\")\n",
+    "\n",
+    "    # Generate an action plan from the trained model\n",
+    "    prompt = build_prompt(remote_obs)\n",
+    "    inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n",
+    "    outputs = model.generate(\n",
+    "        **inputs, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.7, do_sample=True\n",
+    "    )\n",
+    "    completion = tokenizer.decode(\n",
+    "        outputs[0][inputs[\"input_ids\"].shape[1] :], skip_special_tokens=True\n",
     "    )\n",
+    "    actions = parse_action_plan(completion)\n",
+    "\n",
+    "    print(f\"\\nTrained model generated {len(actions)} actions for remote env:\")\n",
+    "    for i, action in enumerate(actions[:BUDGET]):\n",
+    "        result = env.step(action)\n",
+    "        step_obs = result.observation\n",
+    "        reward = float(result.reward) if result.reward is not None else 0.0\n",
+    "        print(\n",
+    "            f\"  Step {i + 1}: {action.intent} {action.parameter or ''} \"\n",
+    "            f\"{action.direction or ''} {action.magnitude or ''} \"\n",
+    "            f\"→ reward={reward:.3f}, score={step_obs.p1_score:.4f}\"\n",
+    "        )\n",
+    "        if result.done:\n",
+    "            print(f\"  Episode done. Final score: {step_obs.p1_score:.4f}\")\n",
+    "            break\n",
     "\n",
     "print(\"\\nEnvironment is live and accessible for training and evaluation.\")"
    ]