CreativeEngineer commited on
Commit
2348d3e
·
1 Parent(s): 9827b11

docs: clarify notebook surfaces and OpenEnv guidance

Browse files
README.md CHANGED
@@ -3,7 +3,7 @@
3
  Fusion Design Lab is an environment-first [OpenEnv](https://openenv.dev) hackathon project for the `P1` stellarator benchmark.
4
 
5
  **Live Environment**: [HF Space](https://huggingface.co/spaces/CreativeEngineer/fusion-design-lab)
6
- **Training Notebook**: [Colab (GRPO + Unsloth)](training/notebooks/fusion_design_lab_training.ipynb)
7
 
8
  ## What It Does
9
 
@@ -57,7 +57,7 @@ The environment uses [`constellaration`](https://pypi.org/project/constellaratio
57
  - [x] Complete paired high-fidelity fixture checks and at least one real submit-side manual trace before any broader training push
58
  - [x] Refresh the heuristic baseline for the real verifier path
59
  - [x] Deploy the real environment to HF Space
60
- - [x] Add the Colab training notebook under `training/notebooks`
61
 
62
  ## Known Gaps
63
 
@@ -121,13 +121,13 @@ uv sync --extra notebooks
121
 
122
  - Recommended compute workspace: Northflank Jupyter Notebook with PyTorch on the team H100
123
  - OpenEnv deployment target: Hugging Face Spaces
124
- - Minimal submission notebook target: Colab
125
- - Required notebook artifact: one public Colab notebook that demonstrates trained-policy behavior against the environment
126
  - Verifier of record: `constellaration.problems.GeometricalProblem`
127
  - Environment style: fresh wiring in this repo, not a port of the old `ai-sci-feasible-designs` harness
128
  - Northflank containers are ephemeral, so persistent storage should be attached before relying on saved models, caches, or fixture data
129
  - Preferred deployment path: push this GitHub repo and let HF Space build from the repo/Docker configuration rather than copying code manually
130
- - Preferred Colab/HF Space connectivity: make the HF Space public for the hackathon unless privacy becomes necessary; if private, document and use an explicit access token in the notebook
131
 
132
  ## Immediate Next Steps
133
 
@@ -139,7 +139,7 @@ uv sync --extra notebooks
139
  - [ ] Save one presentation-ready comparison trace from the refreshed heuristic baseline.
140
  - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
141
  - [x] Deploy the environment to HF Space.
142
- - [x] Add the Colab notebook under `training/notebooks`.
143
 
144
  These are implementation steps, not another planning phase.
145
 
 
3
  Fusion Design Lab is an environment-first [OpenEnv](https://openenv.dev) hackathon project for the `P1` stellarator benchmark.
4
 
5
  **Live Environment**: [HF Space](https://huggingface.co/spaces/CreativeEngineer/fusion-design-lab)
6
+ **Training Notebook**: [Repository Notebook (GRPO + Unsloth)](training/notebooks/fusion_design_lab_training.ipynb)
7
 
8
  ## What It Does
9
 
 
57
  - [x] Complete paired high-fidelity fixture checks and at least one real submit-side manual trace before any broader training push
58
  - [x] Refresh the heuristic baseline for the real verifier path
59
  - [x] Deploy the real environment to HF Space
60
+ - [x] Add the public training notebook under `training/notebooks`
61
 
62
  ## Known Gaps
63
 
 
121
 
122
  - Recommended compute workspace: Northflank Jupyter Notebook with PyTorch on the team H100
123
  - OpenEnv deployment target: Hugging Face Spaces
124
+ - Submission notebook surface: one public notebook artifact; mirror it to Colab if the submission form still requires Colab specifically
125
+ - Required notebook artifact: one public notebook that demonstrates trained-policy behavior against the environment
126
  - Verifier of record: `constellaration.problems.GeometricalProblem`
127
  - Environment style: fresh wiring in this repo, not a port of the old `ai-sci-feasible-designs` harness
128
  - Northflank containers are ephemeral, so persistent storage should be attached before relying on saved models, caches, or fixture data
129
  - Preferred deployment path: push this GitHub repo and let HF Space build from the repo/Docker configuration rather than copying code manually
130
+ - Preferred notebook/HF Space connectivity: make the HF Space public for the hackathon unless privacy becomes necessary; if private, document and use an explicit access token in the notebook
131
 
132
  ## Immediate Next Steps
133
 
 
139
  - [ ] Save one presentation-ready comparison trace from the refreshed heuristic baseline.
140
  - [ ] Use the passing Northflank H100 setup to produce remote traces and comparisons from the real verifier path.
141
  - [x] Deploy the environment to HF Space.
142
+ - [x] Add the public training notebook under `training/notebooks`.
143
 
144
  These are implementation steps, not another planning phase.
145
 
docs/findings/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED
@@ -42,7 +42,7 @@ Still open:
42
 
43
  - decision on whether reset-seed pool should change from paired checks
44
  - HF Space deployment evidence
45
- - Colab artifact wiring
46
  - demo and README polish after the artifacts are real
47
 
48
  Current caution:
@@ -99,7 +99,7 @@ Use the docs like this:
99
  Visible artifacts:
100
 
101
  - [ ] HF Space environment
102
- - [ ] Required Colab notebook
103
  - [ ] 1-minute demo video
104
  - [x] Public repo and README
105
 
@@ -107,7 +107,7 @@ Compute surfaces:
107
 
108
  - Northflank is the main compute workspace for verifier-heavy work
109
  - HF Space is the hosted environment surface
110
- - Colab is the required public artifact and should show trained-policy behavior against the live environment
111
  - trained-policy work should still iterate on low-fidelity `run`; use high-fidelity `submit` only for sparse checkpoint evaluation and final evidence
112
 
113
  Evidence order:
@@ -145,7 +145,7 @@ The live technical details belong in [`P1_ENV_CONTRACT_V1.md`](P1_ENV_CONTRACT_V
145
  - [x] Refresh the heuristic baseline using the repaired-family evidence.
146
  - [ ] Prove a stable local episode path.
147
  - [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
148
- - [ ] Wire the Colab artifact to the live environment.
149
  - [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
150
  - [ ] Polish the public repo only after the artifacts above exist.
151
 
@@ -187,7 +187,7 @@ Gate 7: remote surface is real
187
 
188
  Gate 8: submission artifacts exist
189
 
190
- - Colab, demo, and README all reflect the actual environment rather than a hypothetical future one
191
 
192
  ## 10. Fallback Rules
193
 
 
42
 
43
  - decision on whether reset-seed pool should change from paired checks
44
  - HF Space deployment evidence
45
+ - public notebook artifact wiring
46
  - demo and README polish after the artifacts are real
47
 
48
  Current caution:
 
99
  Visible artifacts:
100
 
101
  - [ ] HF Space environment
102
+ - [ ] Public submission notebook
103
  - [ ] 1-minute demo video
104
  - [x] Public repo and README
105
 
 
107
 
108
  - Northflank is the main compute workspace for verifier-heavy work
109
  - HF Space is the hosted environment surface
110
+ - the public notebook artifact should show trained-policy behavior against the live environment and can be mirrored to Colab if the submission form still requires it
111
  - trained-policy work should still iterate on low-fidelity `run`; use high-fidelity `submit` only for sparse checkpoint evaluation and final evidence
112
 
113
  Evidence order:
 
145
  - [x] Refresh the heuristic baseline using the repaired-family evidence.
146
  - [ ] Prove a stable local episode path.
147
  - [ ] Deploy the same task contract to HF Space and prove one clean remote episode.
148
+ - [ ] Wire the public notebook artifact to the live environment.
149
  - [ ] Record the demo around environment clarity, reward iteration, and baseline evidence.
150
  - [ ] Polish the public repo only after the artifacts above exist.
151
 
 
187
 
188
  Gate 8: submission artifacts exist
189
 
190
+ - the public notebook artifact, demo, and README all reflect the actual environment rather than a hypothetical future one
191
 
192
  ## 10. Fallback Rules
193
 
training/notebooks/README.md CHANGED
@@ -4,14 +4,14 @@ Use this directory for the notebooks that support the hackathon submission.
4
 
5
  Expected contents:
6
 
7
- - one Colab-friendly notebook that connects to the deployed HF Space
8
  - one Northflank-friendly notebook path for verifier sanity checks, manual reward iteration, baselines, or training/debugging
9
 
10
  Recommended split:
11
 
12
  - Northflank notebook: main compute workspace on the team H100
13
- - Colab notebook: thin public artifact required by the hackathon
14
- - trained model: required; the Colab notebook should include a trained-policy demonstration even if performance is modest
15
 
16
  ## Status
17
 
@@ -19,7 +19,7 @@ Recommended split:
19
  - [x] runnable Northflank smoke script saved
20
  - [x] Northflank smoke test passed on the team H100
21
  - [ ] manual-playtest notebook or trace notebook saved
22
- - [ ] thin public Colab notebook saved
23
 
24
  Operational defaults:
25
 
@@ -27,7 +27,7 @@ Operational defaults:
27
  - keep heavy verifier and training work on Northflank
28
  - keep low-fidelity `run` as the training inner loop; do not put high-fidelity `submit` in every RL step
29
  - use high-fidelity `submit` only for sparse checkpoint evaluation, paired fixture checks, manual traces, and final evidence
30
- - keep the Colab notebook focused on connecting to the deployed HF Space and exporting visible traces
31
  - prefer a public HF Space for the hackathon; if private, document the token setup directly in the notebook
32
 
33
  Northflank smoke gate:
@@ -43,4 +43,8 @@ Runnable repo path:
43
  - note: `training/notebooks/NORTHFLANK_SMOKE_NOTE.md`
44
  - latest passing artifact example: `/home/jovyan/fusion-design-lab/smoke/northflank_smoke_20260308T023646Z.json`
45
 
 
 
 
 
46
  The notebooks are supporting evidence for the environment, not the primary product. The required artifact is the notebook plus trained-policy evidence; a standalone checkpoint file is optional only if the notebook can still demonstrate the trained behavior.
 
4
 
5
  Expected contents:
6
 
7
+ - one public notebook artifact that connects to the deployed HF Space; mirror it to Colab if the submission surface requires Colab specifically
8
  - one Northflank-friendly notebook path for verifier sanity checks, manual reward iteration, baselines, or training/debugging
9
 
10
  Recommended split:
11
 
12
  - Northflank notebook: main compute workspace on the team H100
13
+ - public notebook artifact: thin submission surface, mirrored to Colab only if the submission form still requires it
14
+ - trained model: required; the public notebook should include a trained-policy demonstration even if performance is modest
15
 
16
  ## Status
17
 
 
19
  - [x] runnable Northflank smoke script saved
20
  - [x] Northflank smoke test passed on the team H100
21
  - [ ] manual-playtest notebook or trace notebook saved
22
+ - [ ] public submission notebook link saved
23
 
24
  Operational defaults:
25
 
 
27
  - keep heavy verifier and training work on Northflank
28
  - keep low-fidelity `run` as the training inner loop; do not put high-fidelity `submit` in every RL step
29
  - use high-fidelity `submit` only for sparse checkpoint evaluation, paired fixture checks, manual traces, and final evidence
30
+ - keep the public submission notebook focused on connecting to the deployed HF Space and exporting visible traces
31
  - prefer a public HF Space for the hackathon; if private, document the token setup directly in the notebook
32
 
33
  Northflank smoke gate:
 
43
  - note: `training/notebooks/NORTHFLANK_SMOKE_NOTE.md`
44
  - latest passing artifact example: `/home/jovyan/fusion-design-lab/smoke/northflank_smoke_20260308T023646Z.json`
45
 
46
+ LLM notebook helpers should use the packaged prompt/action contract in:
47
+
48
+ - `fusion_lab/llm_agent.py`
49
+
50
  The notebooks are supporting evidence for the environment, not the primary product. The required artifact is the notebook plus trained-policy evidence; a standalone checkpoint file is optional only if the notebook can still demonstrate the trained behavior.
training/notebooks/fusion_design_lab_training.ipynb CHANGED
@@ -87,7 +87,7 @@
87
  "cell_type": "markdown",
88
  "id": "8edb47106e1a46a883d545849b8ab81b",
89
  "metadata": {},
90
- "source": "## 3. Setup Stellarator Environment\n\nInstall the environment package directly from the repository so training runs locally (no network latency per step). The same environment is deployed at the HF Space URL above."
91
  },
92
  {
93
  "cell_type": "code",
@@ -284,13 +284,7 @@
284
  "cell_type": "markdown",
285
  "id": "504fb2a444614c0babb325280ed9130a",
286
  "metadata": {},
287
- "source": [
288
- "## 6. Reward Functions\n",
289
- "\n",
290
- "Two reward signals:\n",
291
- "1. **Format reward**: Does the completion contain a valid JSON action plan?\n",
292
- "2. **Environment reward**: Execute the plan in the stellarator environment and return cumulative reward."
293
- ]
294
  },
295
  {
296
  "cell_type": "code",
@@ -298,72 +292,7 @@
298
  "id": "59bbdb311c014d738909a11f9e486628",
299
  "metadata": {},
300
  "outputs": [],
301
- "source": [
302
- "import traceback\n",
303
- "\n",
304
- "\n",
305
- "def format_reward_fn(completions: list[str], **kwargs) -> list[float]:\n",
306
- " \"\"\"Reward for producing a valid, parseable action plan.\"\"\"\n",
307
- " rewards = []\n",
308
- " for completion in completions:\n",
309
- " actions = parse_action_plan(completion)\n",
310
- " if len(actions) == 0:\n",
311
- " rewards.append(-1.0)\n",
312
- " elif any(a.intent == \"submit\" for a in actions):\n",
313
- " rewards.append(1.0) # valid plan ending with submit\n",
314
- " else:\n",
315
- " rewards.append(0.0) # valid actions but no submit\n",
316
- " return rewards\n",
317
- "\n",
318
- "\n",
319
- "def environment_reward_fn(\n",
320
- " completions: list[str], seed_idx: list[int] | None = None, **kwargs\n",
321
- ") -> list[float]:\n",
322
- " \"\"\"Execute each action plan in the environment and return cumulative reward.\"\"\"\n",
323
- " rewards = []\n",
324
- " seeds = seed_idx if seed_idx is not None else [0] * len(completions)\n",
325
- " for i, completion in enumerate(completions):\n",
326
- " try:\n",
327
- " actions = parse_action_plan(completion)\n",
328
- " if len(actions) == 0:\n",
329
- " rewards.append(-3.0)\n",
330
- " continue\n",
331
- " env = StellaratorEnvironment()\n",
332
- " env.reset(seed=int(seeds[i]) % len(RESET_SEEDS))\n",
333
- " total_reward = 0.0\n",
334
- " for action in actions[:BUDGET]:\n",
335
- " obs = env.step(action)\n",
336
- " total_reward += float(obs.reward or 0.0)\n",
337
- " if obs.done:\n",
338
- " break\n",
339
- " rewards.append(total_reward)\n",
340
- " except Exception:\n",
341
- " traceback.print_exc()\n",
342
- " rewards.append(-3.0)\n",
343
- " return rewards\n",
344
- "\n",
345
- "\n",
346
- "# Test reward functions with a hand-crafted plan\n",
347
- "test_plan = json.dumps(\n",
348
- " [\n",
349
- " {\n",
350
- " \"intent\": \"run\",\n",
351
- " \"parameter\": \"triangularity_scale\",\n",
352
- " \"direction\": \"increase\",\n",
353
- " \"magnitude\": \"small\",\n",
354
- " },\n",
355
- " {\n",
356
- " \"intent\": \"run\",\n",
357
- " \"parameter\": \"rotational_transform\",\n",
358
- " \"direction\": \"increase\",\n",
359
- " \"magnitude\": \"medium\",\n",
360
- " },\n",
361
- " {\"intent\": \"submit\"},\n",
362
- " ]\n",
363
- ")\n",
364
- "print(f\"Format reward: {format_reward_fn([test_plan])}\")\n",
365
- "print(f\"Environment reward: {environment_reward_fn([test_plan], seed_idx=[0])}\")"
366
- ]
367
  },
368
  {
369
  "cell_type": "markdown",
@@ -381,41 +310,7 @@
381
  "id": "8a65eabff63a45729fe45fb5ade58bdc",
382
  "metadata": {},
383
  "outputs": [],
384
- "source": [
385
- "from trl import GRPOConfig, GRPOTrainer\n",
386
- "\n",
387
- "MAX_PROMPT_LENGTH = 768\n",
388
- "MAX_COMPLETION_LENGTH = MAX_SEQ_LENGTH - MAX_PROMPT_LENGTH\n",
389
- "\n",
390
- "training_args = GRPOConfig(\n",
391
- " output_dir=\"./grpo_fusion_output\",\n",
392
- " learning_rate=2e-4,\n",
393
- " num_generations=4,\n",
394
- " max_completion_length=MAX_COMPLETION_LENGTH,\n",
395
- " max_prompt_length=MAX_PROMPT_LENGTH,\n",
396
- " per_device_train_batch_size=4,\n",
397
- " gradient_accumulation_steps=1,\n",
398
- " max_steps=60,\n",
399
- " temperature=1.0,\n",
400
- " logging_steps=1,\n",
401
- " save_steps=20,\n",
402
- " bf16=True,\n",
403
- " report_to=\"none\",\n",
404
- " seed=42,\n",
405
- ")\n",
406
- "\n",
407
- "trainer = GRPOTrainer(\n",
408
- " model=model,\n",
409
- " processing_class=tokenizer,\n",
410
- " reward_funcs=[format_reward_fn, environment_reward_fn],\n",
411
- " args=training_args,\n",
412
- " train_dataset=dataset,\n",
413
- ")\n",
414
- "\n",
415
- "print(\"Starting GRPO training...\")\n",
416
- "train_result = trainer.train()\n",
417
- "print(f\"Training complete. Total steps: {train_result.global_step}\")"
418
- ]
419
  },
420
  {
421
  "cell_type": "markdown",
@@ -518,7 +413,7 @@
518
  " total_reward = 0.0\n",
519
  " for action in actions[:BUDGET]:\n",
520
  " obs = env.step(action)\n",
521
- " r = float(obs.reward or 0.0)\n",
522
  " total_reward += r\n",
523
  " trace.append(\n",
524
  " f\" {action.intent} {action.parameter or ''} {action.direction or ''} {action.magnitude or ''} → reward={r:.3f} score={obs.p1_score:.4f} feasible={obs.constraints_satisfied}\".strip()\n",
@@ -537,12 +432,12 @@
537
  " spec = random.choice(AVAILABLE_ACTIONS[:24]) # run actions only\n",
538
  " action = StellaratorAction(**spec)\n",
539
  " obs = env.step(action)\n",
540
- " total_reward += float(obs.reward or 0.0)\n",
541
  " if obs.done:\n",
542
  " return total_reward\n",
543
  " # submit on last step\n",
544
  " obs = env.step(StellaratorAction(intent=\"submit\"))\n",
545
- " total_reward += float(obs.reward or 0.0)\n",
546
  " return total_reward\n",
547
  "\n",
548
  "\n",
@@ -579,7 +474,7 @@
579
  "cell_type": "markdown",
580
  "id": "cb1e1581032b452c9409d6c6813c49d1",
581
  "metadata": {},
582
- "source": "## 10. Connect to Deployed HF Space\n\nDemonstrate connecting to the live environment on Hugging Face Spaces and running the trained model against it."
583
  },
584
  {
585
  "cell_type": "code",
@@ -590,7 +485,7 @@
590
  "source": [
591
  "import requests\n",
592
  "\n",
593
- "from fusion_lab.models import StellaratorObservation\n",
594
  "\n",
595
  "HF_SPACE_URL = \"https://creativeengineer-fusion-design-lab.hf.space\"\n",
596
  "\n",
@@ -604,39 +499,38 @@
604
  "print(f\"Constraints: {task['constraints']}\")\n",
605
  "print(f\"Budget: {task['budget']}\")\n",
606
  "\n",
607
- "# Reset an episode on the remote environment\n",
608
- "resp = requests.post(f\"{HF_SPACE_URL}/reset\", json={\"seed\": 42}).json()\n",
609
- "obs_data = resp[\"observation\"]\n",
610
- "print(f\"\\nRemote reset — max_elongation: {obs_data['max_elongation']:.4f}\")\n",
611
- "print(f\" aspect_ratio: {obs_data['aspect_ratio']:.4f}\")\n",
612
- "print(f\" constraints_satisfied: {obs_data['constraints_satisfied']}\")\n",
613
- "print(f\" budget_remaining: {obs_data['budget_remaining']}\")\n",
614
- "\n",
615
- "# Generate an action plan from the trained model\n",
616
- "remote_obs = StellaratorObservation.model_validate(obs_data)\n",
617
- "prompt = build_prompt(remote_obs)\n",
618
- "inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n",
619
- "outputs = model.generate(\n",
620
- " **inputs, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.7, do_sample=True\n",
621
- ")\n",
622
- "completion = tokenizer.decode(outputs[0][inputs[\"input_ids\"].shape[1] :], skip_special_tokens=True)\n",
623
- "actions = parse_action_plan(completion)\n",
624
- "\n",
625
- "print(f\"\\nTrained model generated {len(actions)} actions for remote env:\")\n",
626
- "for i, action in enumerate(actions[:BUDGET]):\n",
627
- " action_payload = action.model_dump(exclude_none=True)\n",
628
- " step_resp = requests.post(f\"{HF_SPACE_URL}/step\", json={\"action\": action_payload}).json()\n",
629
- " r = step_resp.get(\"reward\", 0)\n",
630
- " done = step_resp.get(\"done\", False)\n",
631
- " step_obs = step_resp[\"observation\"]\n",
632
- " print(\n",
633
- " f\" Step {i + 1}: {action.intent} {action.parameter or ''} \"\n",
634
- " f\"{action.direction or ''} {action.magnitude or ''} \"\n",
635
- " f\"→ reward={r:.3f}, score={step_obs['p1_score']:.4f}\"\n",
636
  " )\n",
637
- " if done:\n",
638
- " print(f\" Episode done. Final score: {step_obs['p1_score']:.4f}\")\n",
639
- " break\n",
 
 
 
 
 
 
 
 
 
 
 
 
640
  "\n",
641
  "print(\"\\nEnvironment is live and accessible for training and evaluation.\")"
642
  ]
 
87
  "cell_type": "markdown",
88
  "id": "8edb47106e1a46a883d545849b8ab81b",
89
  "metadata": {},
90
+ "source": "## 3. Setup Stellarator Environment\n\nInstall the environment package directly from the HF Space repository so training runs locally (no network latency per step). The package also includes the typed `FusionLabClient` and Pydantic models for remote OpenEnv sessions."
91
  },
92
  {
93
  "cell_type": "code",
 
284
  "cell_type": "markdown",
285
  "id": "504fb2a444614c0babb325280ed9130a",
286
  "metadata": {},
287
+ "source": "## 6. Reward Functions\n\nTwo reward signals for GRPO:\n1. **Format reward**: Is the completion a valid JSON action plan?\n2. **Environment reward**: Execute the plan in the stellarator environment and return the cumulative reward. The environment's built-in reward already decomposes feasibility (+3/−3 crossing bonuses, feasibility progress), objective (max elongation improvement), step costs, submit bonuses, and failure penalties — see `server/environment.py:_compute_reward`."
 
 
 
 
 
 
288
  },
289
  {
290
  "cell_type": "code",
 
292
  "id": "59bbdb311c014d738909a11f9e486628",
293
  "metadata": {},
294
  "outputs": [],
295
+ "source": "import traceback\n\n\ndef format_reward_fn(completions: list[str], **kwargs) -> list[float]:\n \"\"\"Reward for producing a valid, parseable action plan.\"\"\"\n rewards = []\n for completion in completions:\n actions = parse_action_plan(completion)\n if len(actions) == 0:\n rewards.append(-1.0)\n elif any(a.intent == \"submit\" for a in actions):\n rewards.append(1.0)\n else:\n rewards.append(0.0)\n return rewards\n\n\ndef environment_reward_fn(\n completions: list[str], seed_idx: list[int] | None = None, **kwargs\n) -> list[float]:\n \"\"\"Execute each action plan in the environment and return cumulative reward.\n\n The environment's _compute_reward already includes:\n - Feasibility crossing bonuses (+3/-3)\n - Infeasible progress: 5.0 * delta_feasibility\n - Feasible improvement: 10.0 * delta_max_elongation\n - Submit improvement bonus: 5.0 * ratio + budget_fraction\n - Step cost (-0.1), failure penalties, recovery bonuses\n \"\"\"\n rewards = []\n seeds = seed_idx if seed_idx is not None else [0] * len(completions)\n for i, completion in enumerate(completions):\n try:\n actions = parse_action_plan(completion)\n if len(actions) == 0:\n rewards.append(-3.0)\n continue\n env = StellaratorEnvironment()\n env.reset(seed=int(seeds[i]) % len(RESET_SEEDS))\n total_reward = 0.0\n for action in actions[:BUDGET]:\n obs = env.step(action)\n total_reward += float(obs.reward) if obs.reward is not None else 0.0\n if obs.done:\n break\n rewards.append(total_reward)\n except Exception:\n traceback.print_exc()\n rewards.append(-3.0)\n return rewards\n\n\n# Test reward functions with a hand-crafted plan\ntest_plan = json.dumps(\n [\n {\n \"intent\": \"run\",\n \"parameter\": \"triangularity_scale\",\n \"direction\": \"increase\",\n \"magnitude\": \"small\",\n },\n {\n \"intent\": \"run\",\n \"parameter\": \"rotational_transform\",\n \"direction\": \"increase\",\n \"magnitude\": \"medium\",\n },\n {\"intent\": \"submit\"},\n ]\n)\nprint(f\"Format reward: {format_reward_fn([test_plan])}\")\nprint(f\"Environment reward: {environment_reward_fn([test_plan], seed_idx=[0])}\")"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
296
  },
297
  {
298
  "cell_type": "markdown",
 
310
  "id": "8a65eabff63a45729fe45fb5ade58bdc",
311
  "metadata": {},
312
  "outputs": [],
313
+ "source": "from trl import GRPOConfig, GRPOTrainer\n\nMAX_PROMPT_LENGTH = 768\nMAX_COMPLETION_LENGTH = MAX_SEQ_LENGTH - MAX_PROMPT_LENGTH\n\ntraining_args = GRPOConfig(\n output_dir=\"./grpo_fusion_output\",\n learning_rate=2e-4,\n num_generations=4,\n max_completion_length=MAX_COMPLETION_LENGTH,\n max_prompt_length=MAX_PROMPT_LENGTH,\n per_device_train_batch_size=4,\n gradient_accumulation_steps=1,\n max_steps=60,\n temperature=1.0,\n logging_steps=1,\n save_steps=20,\n bf16=True,\n report_to=\"none\",\n seed=42,\n)\n\ntrainer = GRPOTrainer(\n model=model,\n processing_class=tokenizer,\n reward_funcs=[format_reward_fn, environment_reward_fn],\n args=training_args,\n train_dataset=dataset,\n)\n\nprint(\"Starting GRPO training...\")\ntrain_result = trainer.train()\nprint(f\"Training complete. Total steps: {train_result.global_step}\")"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
314
  },
315
  {
316
  "cell_type": "markdown",
 
413
  " total_reward = 0.0\n",
414
  " for action in actions[:BUDGET]:\n",
415
  " obs = env.step(action)\n",
416
+ " r = float(obs.reward) if obs.reward is not None else 0.0\n",
417
  " total_reward += r\n",
418
  " trace.append(\n",
419
  " f\" {action.intent} {action.parameter or ''} {action.direction or ''} {action.magnitude or ''} → reward={r:.3f} score={obs.p1_score:.4f} feasible={obs.constraints_satisfied}\".strip()\n",
 
432
  " spec = random.choice(AVAILABLE_ACTIONS[:24]) # run actions only\n",
433
  " action = StellaratorAction(**spec)\n",
434
  " obs = env.step(action)\n",
435
+ " total_reward += float(obs.reward) if obs.reward is not None else 0.0\n",
436
  " if obs.done:\n",
437
  " return total_reward\n",
438
  " # submit on last step\n",
439
  " obs = env.step(StellaratorAction(intent=\"submit\"))\n",
440
+ " total_reward += float(obs.reward) if obs.reward is not None else 0.0\n",
441
  " return total_reward\n",
442
  "\n",
443
  "\n",
 
474
  "cell_type": "markdown",
475
  "id": "cb1e1581032b452c9409d6c6813c49d1",
476
  "metadata": {},
477
+ "source": "## 10. Connect to Deployed HF Space\n\nDemonstrate connecting to the live environment on Hugging Face Spaces through the typed OpenEnv client and running the trained model against it."
478
  },
479
  {
480
  "cell_type": "code",
 
485
  "source": [
486
  "import requests\n",
487
  "\n",
488
+ "from fusion_lab.client import FusionLabClient\n",
489
  "\n",
490
  "HF_SPACE_URL = \"https://creativeengineer-fusion-design-lab.hf.space\"\n",
491
  "\n",
 
499
  "print(f\"Constraints: {task['constraints']}\")\n",
500
  "print(f\"Budget: {task['budget']}\")\n",
501
  "\n",
502
+ "with FusionLabClient(base_url=HF_SPACE_URL) as env:\n",
503
+ " reset_result = env.reset(seed=42)\n",
504
+ " remote_obs = reset_result.observation\n",
505
+ " print(f\"\\nRemote reset — max_elongation: {remote_obs.max_elongation:.4f}\")\n",
506
+ " print(f\" aspect_ratio: {remote_obs.aspect_ratio:.4f}\")\n",
507
+ " print(f\" constraints_satisfied: {remote_obs.constraints_satisfied}\")\n",
508
+ " print(f\" budget_remaining: {remote_obs.budget_remaining}\")\n",
509
+ "\n",
510
+ " # Generate an action plan from the trained model\n",
511
+ " prompt = build_prompt(remote_obs)\n",
512
+ " inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n",
513
+ " outputs = model.generate(\n",
514
+ " **inputs, max_new_tokens=MAX_COMPLETION_LENGTH, temperature=0.7, do_sample=True\n",
515
+ " )\n",
516
+ " completion = tokenizer.decode(\n",
517
+ " outputs[0][inputs[\"input_ids\"].shape[1] :], skip_special_tokens=True\n",
 
 
 
 
 
 
 
 
 
 
 
 
 
518
  " )\n",
519
+ " actions = parse_action_plan(completion)\n",
520
+ "\n",
521
+ " print(f\"\\nTrained model generated {len(actions)} actions for remote env:\")\n",
522
+ " for i, action in enumerate(actions[:BUDGET]):\n",
523
+ " result = env.step(action)\n",
524
+ " step_obs = result.observation\n",
525
+ " reward = float(result.reward) if result.reward is not None else 0.0\n",
526
+ " print(\n",
527
+ " f\" Step {i + 1}: {action.intent} {action.parameter or ''} \"\n",
528
+ " f\"{action.direction or ''} {action.magnitude or ''} \"\n",
529
+ " f\"→ reward={reward:.3f}, score={step_obs.p1_score:.4f}\"\n",
530
+ " )\n",
531
+ " if result.done:\n",
532
+ " print(f\" Episode done. Final score: {step_obs.p1_score:.4f}\")\n",
533
+ " break\n",
534
  "\n",
535
  "print(\"\\nEnvironment is live and accessible for training and evaluation.\")"
536
  ]