CreativeEngineer's picture
feat: reward verifier alignment, notebook hardening, model name fix
cdc237b

Training and evaluation notebooks belong here.

This repository treats notebooks and trained-policy runs as supporting evidence for the environment, not the primary product.

Training policy:

  • train on the live low-fidelity environment surface, including explicit submit
  • keep the standard training/llm_rollout.py monitor/evaluate workflow on the same live contract as the notebook
  • keep high-fidelity validation in offline tooling such as baselines/high_fidelity_validation.py

Status

  • Northflank notebook artifacts saved
  • repository GRPO notebook saved
  • Colab mirror or public notebook link saved if required by the submission surface
  • tiny low-fi PPO smoke artifact saved
  • fixed-seed untrained baseline artifact saved
  • before/after trained-policy evidence saved

Runnable paths

  • install the training dependencies: uv sync --extra training
  • tiny low-fi PPO smoke run: uv run --extra training python training/ppo_smoke.py
  • generate an LLM-ready prompt payload: uv run python training/llm_rollout.py prompt --seed 0
  • replay an LLM completion or action plan: uv run python training/llm_rollout.py replay --seed 0 --completion-file <path>
  • monitor reward terms, action clamping, and verifier outcomes across seeds: uv run python training/llm_rollout.py monitor --completion-file <path> --seeds 0,1,2
  • generate fresh model completions per seed and save aggregate reward/outcome metrics: uv run python training/llm_rollout.py evaluate --completion-command 'python path/to/model_cli.py' --seeds 0,1,2

Use monitor when you already have one completion or one action plan and want a fixed replay across seeds. Use evaluate for before/after policy comparison because it generates a fresh completion per seed.

Current validation target

  • save one untrained fixed-seed baseline with evaluate
  • run one short GRPO pass on Northflank H100 with the repository notebook
  • rerun the same seeds and compare reward plus low-fidelity feasibility before versus after

Shared LLM Contract

The prompt/action/replay contract for LLM training lives in:

  • fusion_lab/llm_agent.py

Use that module as the source of truth for:

  • prompt formatting
  • action-plan parsing
  • local rollout replay
  • rollout telemetry structure used by the monitor command

For prompt, monitor, evaluate, and the notebook, the shared helper contract now includes the live submit action. Use offline validation scripts when you explicitly want high-fidelity checks outside the environment loop.

For evaluate, the completion command reads the prompt from stdin and writes a raw completion to stdout. The current seed is exposed as the FUSION_LAB_SEED environment variable so the same command can be used for fixed-seed before/after comparisons of untrained and trained checkpoints.