Spaces:

CreativeEngineer
/

fusion-design-lab

Paused

App Files Files Community

CreativeEngineer commited on Mar 8

Commit

ba716cf

1 Parent(s): 16065b1

docs: sync status trackers to verifier state

Browse files

Files changed (12) hide show

README.md +10 -7
TODO.md +10 -2
baselines/README.md +9 -0
demo/README.md +7 -0
docs/FUSION_DELIVERABLES_MAP.md +19 -11
docs/FUSION_DESIGN_LAB_PLAN_V2.md +18 -1
docs/FUSION_NEXT_12_HOURS_CHECKLIST.md +3 -0
docs/PIVOT_P1_ROTATING_ELLIPSE.md +20 -2
server/data/README.md +4 -0
server/data/p1/README.md +7 -0
training/README.md +6 -0
training/notebooks/README.md +6 -0

README.md CHANGED Viewed

@@ -22,6 +22,7 @@ Implementation status:
 - docs are aligned to fresh `P1` wiring in this repo
 - shared models, baselines, and server/client entry points now reflect the locked `P1` contract
 - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
 ## Execution Status
@@ -35,6 +36,8 @@ Implementation status:
 - [x] Replace the synthetic evaluator with `constellaration`
 - [ ] Add tracked `P1` fixtures under `server/data/p1/`
 - [ ] Run manual playtesting and record the first reward pathology
 - [ ] Deploy the real environment to HF Space
 ## Known Gaps
@@ -47,7 +50,7 @@ Implementation status:
 Current mode:
 - strategic task choice is already locked
-- the next work is implementation, smoke validation, and manual playtesting
 - new planning text should only appear when a real blocker forces a decision change
 ## Planned Repository Layout
@@ -100,15 +103,15 @@ uv sync --extra notebooks
 ## Immediate Next Steps
-1. Set up the Northflank Jupyter Notebook with PyTorch and attach persistent storage.
-2. Pass a Northflank smoke test:
    - import `constellaration`
    - run one rotating-ellipse generation plus one low-fidelity verifier call
    - write an artifact to persistent storage
-3. Add tracked `P1` fixtures under `server/data/p1`.
-4. Refresh the heuristic baseline using manual playtest evidence, then save one comparison trace.
-5. Add the Colab notebook under `training/notebooks`.
-6. Run manual playtest episodes before heavy training work.
 These are implementation steps, not another planning phase.

 - docs are aligned to fresh `P1` wiring in this repo
 - shared models, baselines, and server/client entry points now reflect the locked `P1` contract
 - the current environment uses `constellaration` for low-fidelity `run` steps and high-fidelity `submit` evaluation
+- the remaining runtime work is fixture coverage, manual playtesting, heuristic refresh, and deployment evidence
 ## Execution Status
 - [x] Replace the synthetic evaluator with `constellaration`
 - [ ] Add tracked `P1` fixtures under `server/data/p1/`
 - [ ] Run manual playtesting and record the first reward pathology
+- [ ] Refresh the heuristic baseline for the real verifier path
+- [ ] Pass the Northflank smoke test on the H100 workspace
 - [ ] Deploy the real environment to HF Space
 ## Known Gaps
 Current mode:
 - strategic task choice is already locked
+- the next work is fixtures, manual playtesting, heuristic refresh, smoke validation, and deployment
 - new planning text should only appear when a real blocker forces a decision change
 ## Planned Repository Layout
 ## Immediate Next Steps
+1. Add tracked `P1` fixtures under `server/data/p1`.
+2. Run manual playtest episodes and record the first real reward pathology, if any.
+3. Refresh the heuristic baseline using manual playtest evidence, then save one comparison trace.
+4. Pass a Northflank smoke test:
    - import `constellaration`
    - run one rotating-ellipse generation plus one low-fidelity verifier call
    - write an artifact to persistent storage
+5. Deploy the environment to HF Space.
+6. Add the Colab notebook under `training/notebooks`.
 These are implementation steps, not another planning phase.

TODO.md CHANGED Viewed

@@ -36,9 +36,9 @@ Priority source:
 ```mermaid
 flowchart TD
-    A["Northflank Smoke Test"] --> C["constellaration Physics Wiring"]
     B["P1 Contract Lock"] --> D["P1 Models + Environment"]
-    C --> D
     D --> E["Fixture Checks"]
     E --> F["Manual Playtest"]
     F --> G["Reward V1"]
@@ -121,6 +121,13 @@ flowchart TD
   [AGENTS.md](AGENTS.md),
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
 - [x] Decide the non-submit terminal reward policy
   Goal:
   budget exhaustion now yields a smaller end-of-episode reward than `submit`, so non-submitting agents still get terminal feedback without outranking explicit submit behavior
@@ -184,3 +191,4 @@ flowchart TD
 - [ ] Do not add training-first complexity before manual playtesting
 - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
 - [ ] Do not describe the current baseline reset state as already feasible

 ```mermaid
 flowchart TD
+    A["Northflank Smoke Test"] --> E["Fixture Checks"]
     B["P1 Contract Lock"] --> D["P1 Models + Environment"]
+    C["constellaration Physics Wiring"] --> D
     D --> E["Fixture Checks"]
     E --> F["Manual Playtest"]
     F --> G["Reward V1"]
   [AGENTS.md](AGENTS.md),
   [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
+- [ ] Write down whether `Reward V0` survives unchanged
+  Goal:
+  if playtesting does not reveal a real pathology, record that outcome explicitly instead of forcing a `V1`
+  Related:
+  [README.md](README.md),
+  [Plan V2](docs/FUSION_DESIGN_LAB_PLAN_V2.md)
 - [x] Decide the non-submit terminal reward policy
   Goal:
   budget exhaustion now yields a smaller end-of-episode reward than `submit`, so non-submitting agents still get terminal feedback without outranking explicit submit behavior
 - [ ] Do not add training-first complexity before manual playtesting
 - [ ] Do not describe low-fidelity `run` metrics as equivalent to high-fidelity `submit` results
 - [ ] Do not describe the current baseline reset state as already feasible
+- [ ] Do not force a `Reward V1` story if `Reward V0` survives manual playtesting

baselines/README.md CHANGED Viewed

@@ -1,5 +1,14 @@
 Random and heuristic baselines will live here.
 The first baseline milestone is:
 - one random agent

 Random and heuristic baselines will live here.
+## Status
+- [x] random baseline exists
+- [x] heuristic baseline exists
+- [x] baseline comparison script exists
+- [x] baseline comparison rerun completed on the real verifier path
+- [ ] heuristic refreshed after the real-verifier rerun
+- [ ] presentation-ready comparison trace exported
 The first baseline milestone is:
 - one random agent

demo/README.md CHANGED Viewed

@@ -1,5 +1,12 @@
 Demo assets belong here.
 Expected contents:
 - one stable episode capture

 Demo assets belong here.
+## Status
+- [ ] stable episode capture exported
+- [ ] reward-iteration story note exported
+- [ ] baseline comparison figure exported
+- [ ] final 1-minute script drafted
 Expected contents:
 - one stable episode capture

docs/FUSION_DELIVERABLES_MAP.md CHANGED Viewed

@@ -6,6 +6,16 @@ Northflank is the recommended compute workspace behind those artifacts. HF Space
 Use this map to sequence execution, not to reopen already-locked task choices.
 ## Deliverables Tree
 ```mermaid
@@ -90,14 +100,12 @@ flowchart LR
 ## Priority Order
-1. Bring up the Northflank H100 workspace with persistent storage.
-2. Pass the Northflank smoke test.
-3. Prove the fresh local `P1` verifier loop.
-4. Freeze the environment contract and mark the initial reward as `V0`.
-5. Run verifier/fixture checks and then manual-playtest the environment.
-6. Fix obvious reward/pathology issues.
-7. Make one stable OpenEnv `P1` task work remotely with clear, reproducible rules.
-8. Get random and heuristic baselines.
-9. Use the notebook to show traces and comparisons; include training only if it adds signal.
-10. Record the demo around environment clarity, verifier fidelity, reward shaping, and one stable trajectory.
-11. Polish the repo only after the artifacts are real.

 Use this map to sequence execution, not to reopen already-locked task choices.
+## Current Branch Status
+- [x] `P1` contract is frozen in code
+- [x] official `constellaration` verifier loop is wired
+- [x] baseline comparison has been rerun on the real verifier path
+- [ ] tracked fixtures are checked in
+- [ ] manual playtest evidence exists
+- [ ] heuristic baseline has been refreshed for the real verifier path
+- [ ] HF Space deployment is live
 ## Deliverables Tree
 ```mermaid
 ## Priority Order
+1. Add tracked fixtures and run fixture sanity checks.
+2. Manual-playtest the environment and record the first real pathology, if any.
+3. Refresh the heuristic baseline from that evidence.
+4. Bring up the Northflank H100 workspace with persistent storage.
+5. Pass the Northflank smoke test.
+6. Make one stable OpenEnv `P1` task work remotely with clear, reproducible rules.
+7. Use the notebook to show traces and comparisons; include training only if it adds signal.
+8. Record the demo around environment clarity, verifier fidelity, reward shaping, and one stable trajectory.
+9. Polish the repo only after the artifacts are real.

docs/FUSION_DESIGN_LAB_PLAN_V2.md CHANGED Viewed

@@ -4,6 +4,19 @@
 **Track:** Statement 3.1 (World Modeling — Professional Tasks)
 **Status:** Judge-aligned plan with `P1` locked
 ## 1. Submission Thesis
 We are not primarily submitting "a trained model for fusion."
@@ -154,7 +167,7 @@ Allowed reuse:
 Implementation handoff:
-- the remaining work is now wiring, smoke validation, manual playtesting, baselines, and deployment
 - do not treat supporting decision notes as a new planning backlog
 ## 8.1 Compute Surfaces
@@ -296,6 +309,10 @@ We should expect at least some of these:
 The reward is only acceptable after we test for those behaviors.
 ## 12. Verifier and Reward Fixture Checks
 Before training, we should validate environment wiring with a few fixed fixtures.

 **Track:** Statement 3.1 (World Modeling — Professional Tasks)
 **Status:** Judge-aligned plan with `P1` locked
+## 0. Current Branch Status
+- [x] `P1` task family is locked
+- [x] rotating-ellipse `P1` contract is implemented in code
+- [x] real `constellaration` verifier wiring is in place
+- [x] low-fidelity `run` plus high-fidelity `submit` split is documented
+- [x] post-terminal `step()` guard is in place
+- [x] baseline comparison has been rerun on the real verifier path
+- [ ] tracked `P1` fixtures are added
+- [ ] manual playtest evidence is recorded
+- [ ] heuristic baseline is refreshed for the real verifier path
+- [ ] HF Space deployment evidence is recorded
 ## 1. Submission Thesis
 We are not primarily submitting "a trained model for fusion."
 Implementation handoff:
+- the remaining work is now fixture coverage, manual playtesting, heuristic refresh, smoke validation, and deployment
 - do not treat supporting decision notes as a new planning backlog
 ## 8.1 Compute Surfaces
 The reward is only acceptable after we test for those behaviors.
+Important execution rule:
+- if manual playtesting does not reveal a real pathology, keep `Reward V0` and document that outcome rather than forcing a `Reward V1`
 ## 12. Verifier and Reward Fixture Checks
 Before training, we should validate environment wiring with a few fixed fixtures.

docs/FUSION_NEXT_12_HOURS_CHECKLIST.md CHANGED Viewed

@@ -13,7 +13,9 @@ Do not expand scope beyond one stable task. Training is supporting evidence, not
 - [x] baselines and API surface have been moved to the `P1` contract
 - [x] add a post-terminal guard in `step()`
 - [x] replace the synthetic evaluator with `constellaration`
 - [ ] add tracked fixtures and manual playtest evidence
 ## Plan V2 Inheritance
@@ -103,6 +105,7 @@ Transition rule:
    - bad behavior
    - refinement to reward V1
    - improved behavior
 Exit condition: you can explain why the environment now rewards the intended behavior.

 - [x] baselines and API surface have been moved to the `P1` contract
 - [x] add a post-terminal guard in `step()`
 - [x] replace the synthetic evaluator with `constellaration`
+- [x] re-run baselines on the real verifier path
 - [ ] add tracked fixtures and manual playtest evidence
+- [ ] refresh the heuristic baseline after the real-verifier rerun
 ## Plan V2 Inheritance
    - bad behavior
    - refinement to reward V1
    - improved behavior
+8. If no real pathology appears, record that `Reward V0` survived playtesting and move on.
 Exit condition: you can explain why the environment now rewards the intended behavior.

docs/PIVOT_P1_ROTATING_ELLIPSE.md CHANGED Viewed

@@ -6,6 +6,15 @@
 Use this file as rationale for the pivot, not as a fresh planning queue. Once the pivot is accepted, implementation should follow the SSOT plan docs.
 ## Decision
 Pivot the OpenEnv environment to use the official ConStellaration P1 benchmark with real VMEC physics, scoped to the rotating-ellipse low-dimensional parameter space.
@@ -147,7 +156,6 @@ current_params: {aspect_ratio, elongation, rotational_transform}
 best_params: {aspect_ratio, elongation, rotational_transform}
 initial_score: float
 best_score: float
-current_feasibility: float
 best_feasibility: float
 history: list[str]
 ```
@@ -171,13 +179,17 @@ history: list[str]
 ### Phase 1: Physics Backend (~1 hour)
 Rewrite `server/physics.py` to wrap:
 - `constellaration.initial_guess.generate_rotating_ellipse` for boundary generation
 - `constellaration.forward_model.forward_model` with low-fi settings for evaluation
-- `constellaration.problems.GeometricalProblem` for official P1 scoring on submit
 ### Phase 2: Environment Contract (~1 hour)
 Update `server/environment.py`:
 - New observation schema with P1 metrics
 - New action schema for rotating-ellipse perturbations
@@ -188,6 +200,8 @@ Update `fusion_lab/models.py` for new schemas.
 ### Phase 3: Manual Playtest (~30 min)
 Validate hypothesis: "6 actions is enough."
 - Play 5-10 episodes manually
 - Log: can a human reach feasibility? Improve elongation?
@@ -196,12 +210,16 @@ Validate hypothesis: "6 actions is enough."
 ### Phase 4: Baselines (~30 min)
 - Random agent
 - Heuristic agent (greedy toward known-good parameter region)
 - Comparison table
 ### Phase 5: Deploy + Evidence (~2 hours)
 - Update Dockerfile/deps for constellaration
 - `openenv validate` + `openenv push`
 - Colab notebook connecting to live environment

 Use this file as rationale for the pivot, not as a fresh planning queue. Once the pivot is accepted, implementation should follow the SSOT plan docs.
+## Current Branch Status
+- [x] pivot accepted
+- [x] rotating-ellipse `P1` contract is implemented
+- [x] `constellaration` verifier path is wired
+- [ ] tracked fixtures are added
+- [ ] manual playtest evidence is recorded
+- [ ] heuristic baseline is refreshed for the real verifier path
 ## Decision
 Pivot the OpenEnv environment to use the official ConStellaration P1 benchmark with real VMEC physics, scoped to the rotating-ellipse low-dimensional parameter space.
 best_params: {aspect_ratio, elongation, rotational_transform}
 initial_score: float
 best_score: float
 best_feasibility: float
 history: list[str]
 ```
 ### Phase 1: Physics Backend (~1 hour)
+Status: done.
 Rewrite `server/physics.py` to wrap:
 - `constellaration.initial_guess.generate_rotating_ellipse` for boundary generation
 - `constellaration.forward_model.forward_model` with low-fi settings for evaluation
+- `constellaration.problems.GeometricalProblem` for official P1 scoring on every evaluation
 ### Phase 2: Environment Contract (~1 hour)
+Status: done.
 Update `server/environment.py`:
 - New observation schema with P1 metrics
 - New action schema for rotating-ellipse perturbations
 ### Phase 3: Manual Playtest (~30 min)
+Status: open.
 Validate hypothesis: "6 actions is enough."
 - Play 5-10 episodes manually
 - Log: can a human reach feasibility? Improve elongation?
 ### Phase 4: Baselines (~30 min)
+Status: partial. Baselines exist, but the heuristic needs refresh on the real verifier path.
 - Random agent
 - Heuristic agent (greedy toward known-good parameter region)
 - Comparison table
 ### Phase 5: Deploy + Evidence (~2 hours)
+Status: open.
 - Update Dockerfile/deps for constellaration
 - `openenv validate` + `openenv push`
 - Colab notebook connecting to live environment

server/data/README.md CHANGED Viewed

@@ -1,3 +1,7 @@
 Baseline VMEC inputs and related static assets belong here.
 Do not commit generated solver outputs or large transient artifacts.

 Baseline VMEC inputs and related static assets belong here.
 Do not commit generated solver outputs or large transient artifacts.
+## Status
+- [ ] tracked `P1` fixture assets added under `server/data/p1/`

server/data/p1/README.md CHANGED Viewed

@@ -10,4 +10,11 @@ Intended contents:
 These fixtures are for verifier and reward sanity checks.
 Do not copy the old `ai-sci-feasible-designs` harness here. Reuse only the specific JSON artifacts needed for the fresh `P1` environment.

 These fixtures are for verifier and reward sanity checks.
+## Status
+- [ ] known-good or near-winning fixture added
+- [ ] near-boundary fixture added
+- [ ] clearly infeasible fixture added
+- [ ] fixture sanity note written
 Do not copy the old `ai-sci-feasible-designs` harness here. Reuse only the specific JSON artifacts needed for the fresh `P1` environment.

training/README.md CHANGED Viewed

@@ -1,3 +1,9 @@
 Training and evaluation notebooks belong here.
 This repository treats notebooks as supporting evidence for the environment, not the primary product.

 Training and evaluation notebooks belong here.
 This repository treats notebooks as supporting evidence for the environment, not the primary product.
+## Status
+- [ ] Northflank notebook artifacts saved
+- [ ] Colab notebook saved
+- [ ] training evidence included only if it is persuasive

training/notebooks/README.md CHANGED Viewed

@@ -12,6 +12,12 @@ Recommended split:
 - Northflank notebook: main compute workspace on the team H100
 - Colab notebook: thin public artifact required by the hackathon
 Operational defaults:
 - use the same Python dependency set as the repo runtime

 - Northflank notebook: main compute workspace on the team H100
 - Colab notebook: thin public artifact required by the hackathon
+## Status
+- [ ] Northflank smoke notebook note saved
+- [ ] manual-playtest notebook or trace notebook saved
+- [ ] thin public Colab notebook saved
 Operational defaults:
 - use the same Python dependency set as the repo runtime