Spaces:

CreativeEngineer
/

fusion-design-lab

Running on CPU Upgrade

App Files Files Community

fusion-design-lab / training /README.md

CreativeEngineer

feat: reward verifier alignment, notebook hardening, model name fix

cdc237b 9 days ago

preview code

raw

history blame contribute delete

2.84 kB

Training and evaluation notebooks belong here.

This repository treats notebooks and trained-policy runs as supporting evidence for the environment, not the primary product.

Training policy:

train on the live low-fidelity environment surface, including explicit submit
keep the standard training/llm_rollout.py monitor/evaluate workflow on the same live contract as the notebook
keep high-fidelity validation in offline tooling such as baselines/high_fidelity_validation.py

Status

Northflank notebook artifacts saved
repository GRPO notebook saved
Colab mirror or public notebook link saved if required by the submission surface
tiny low-fi PPO smoke artifact saved
fixed-seed untrained baseline artifact saved
before/after trained-policy evidence saved

Runnable paths

install the training dependencies: uv sync --extra training
tiny low-fi PPO smoke run: uv run --extra training python training/ppo_smoke.py
generate an LLM-ready prompt payload: uv run python training/llm_rollout.py prompt --seed 0
replay an LLM completion or action plan: uv run python training/llm_rollout.py replay --seed 0 --completion-file <path>
monitor reward terms, action clamping, and verifier outcomes across seeds: uv run python training/llm_rollout.py monitor --completion-file <path> --seeds 0,1,2
generate fresh model completions per seed and save aggregate reward/outcome metrics: uv run python training/llm_rollout.py evaluate --completion-command 'python path/to/model_cli.py' --seeds 0,1,2

Use monitor when you already have one completion or one action plan and want a fixed replay across seeds. Use evaluate for before/after policy comparison because it generates a fresh completion per seed.

Current validation target

save one untrained fixed-seed baseline with evaluate
run one short GRPO pass on Northflank H100 with the repository notebook
rerun the same seeds and compare reward plus low-fidelity feasibility before versus after

Shared LLM Contract

The prompt/action/replay contract for LLM training lives in:

fusion_lab/llm_agent.py

Use that module as the source of truth for:

prompt formatting
action-plan parsing
local rollout replay
rollout telemetry structure used by the monitor command

For prompt, monitor, evaluate, and the notebook, the shared helper contract now includes the live submit action. Use offline validation scripts when you explicitly want high-fidelity checks outside the environment loop.

For evaluate, the completion command reads the prompt from stdin and writes a raw completion to stdout. The current seed is exposed as the FUSION_LAB_SEED environment variable so the same command can be used for fixed-seed before/after comparisons of untrained and trained checkpoints.