Spaces:

chane335
/

permanence

Paused

App Files Files Community

permanence / docs /METHODS.md

chane335

PERMANENCE training: 4-stage SFT -> gate -> GRPO -> eval pipeline

8aa902a verified about 1 month ago

preview code

raw

history blame contribute delete

9.15 kB

PERMANENCE — Training Methodology

This document explains the methodological choices behind the training pipeline and why they are made. It is intended for reviewers who want to understand the research decisions, and for practitioners who want to port the recipe to a different env.

1. Why not pure supervised fine-tuning

The obvious first try is to generate a dataset of (prompt, gold_completion) pairs and do SFT. We rejected that approach for three reasons:

Calibration cannot be supervised from demonstrations alone. The reward term level_accuracy × (1 − |confidence − level_accuracy|) scores the confidence the model emits. Demonstration traces force a single confidence value per example, which is not the same as teaching the model how its confidence should vary across examples. RL optimises this distributionally.
Destructive-outcome scenarios need exploration. In the variants where the normally-safe action is disabled, the policy has to discover that the destructive action is now the correct one. A supervised dataset that demonstrates the destructive action would just teach "when prompt contains 'URGENT' → do the destructive action", which the policy would over-fit. RL allows the policy to reach the same conclusion by trying both.
Option preservation is a trajectory-level signal. Whether an episode's early actions closed off downstream options can only be scored at episode end. GRPO's group-relative advantage over complete rollouts is the natural fit.

We do use SFT for warmup — see §2 — but only to teach the output format and a bias toward producing well-formed R-level predictions before RL optimises the policy.

2. SFT warmup: traces generated by the live environment

The warmup dataset is 78 traces spanning R1–R5. The traces are generated by stepping the live environment at trace-creation time:

env = PermanenceEnv(config={"force_task": task_id})
obs, info = env.reset(seed=seed)
world = env._current_world_state
action = ACTION_REGISTRY[action_id]
resolved_r = action.r_level_fn(world, params)    # source of truth
completion = synthesise_completion(resolved_r, ...)

This matters because the env's scenario generator is stochastic with respect to pre-existing backups, snapshots, and clone preservation. A fixed "seed X → backup present" assumption would break silently across processes with different PYTHONHASHSEED. Resolving the R-level from the live env every time the trace is regenerated eliminates this class of bug.

Distribution of the 78 traces: R1 = 22, R2 = 23, R3 = 3, R4 = 7, R5 = 23. The underweight on R3 and R4 is acknowledged in the README's "Honest limits" section; it reflects the scenario generator's default distribution rather than a hidden preference.

3. Format-coverage gate

Between SFT and GRPO we run a gate: 20 held-out prompts, model generates a completion for each, the gate checks that both <action/> and <reversibility/> tags are present on at least 80 % of completions.

The gate exists because we saw two early pipeline failures in which SFT converged to low loss but emitted malformed tags at generation time (collision with the instruction-tuning prior). Running the full GRPO stage on a malformed policy would burn ~60 minutes of GPU time for no useful signal. The gate catches this in ~1 minute.

4. GRPO configuration

We use TRL's GRPOTrainer under Unsloth 4-bit quantisation with LoRA rank 16. Settings worth explaining:

Parameter	Value	Reason
`group_size`	4	Per-prompt rollout diversity; enough for the relative-advantage calculation to have non-zero variance on most prompts
`num_iterations` (μ)	2	Two inner PPO updates per generation batch. Trades a small amount of off-policy drift for faster convergence
`beta` (KL coefficient)	0.04	The TRL default. Higher β-values constrain the policy from drifting far from the SFT reference, which prevents a late-training "forgetting" failure mode where the policy loses previously-correct predictions as the curriculum phases in harder tasks
`temperature`	0.85	High enough that rollouts within a group differ meaningfully, so the group-relative advantage has a useful gradient
`total_episodes`	300 prompts	300 × 4 = 1 200 rollouts on a T4 in ~70 min
`max_completion_length`	280	Our completions are three short tags; longer budgets invite length-drift without improving signal

4.1 On reward shaping

We deliberately do not shape the environmental reward beyond a dynamic weighting that phases the format reward out between episodes 60 and 150. Every other signal the policy sees during GRPO is the same four-component rubric it will be evaluated on.

We considered an "unlikeliness" shaping term (reward rare correct solutions more) but removed it after observing that the technique is designed for binary-verifier tasks like theorem proving. In a continuous-reward classification task like ours, where partial credit means the top-ranked reward sample is usually the correct one, the shaping penalises correctness. The clearest diagnostic was a single metric from a pilot run:

db_snapshot (actual R-level R2):
  predicted R1 → avg shaped reward 0.773
  predicted R2 → avg shaped reward 0.751

The shaping inverted the gradient. Disabling it restored the expected ordering (correct R2 > incorrect R1), which we verified by a quick sanity check over 4 sample rollouts before committing to the change. The general principle — match the training signal to the evaluation signal, don't add gradient pressure you will not measure — is the methodological guidance we ship here.

4.2 Length monitor

Independently of the reward architecture, the pipeline tracks the rolling-window mean completion length. If it exceeds 1 000 characters for three consecutive windows, the callback aborts training with a clean error. This caught two early failure modes where the policy drifted into verbose explanation blocks (+3 × completion length, −50 % throughput) that are penalised by the format rubric but not enough to outweigh the GRPO advantage from the occasional correct solution in the long tail. The monitor aborts those runs cleanly instead of letting them burn the full GPU budget.

5. Curriculum

The task sampler follows a three-phase curriculum:

Episodes	Composition
0 – 49	Standard tasks only. The policy establishes a baseline on the familiar distribution.
50 – 149	50 % destructive-outcome variants. The policy is exposed to the tasks where the normally-safe action is unavailable.
150 – 299	70 % destructive-outcome variants. The policy is pushed to solve the hard distribution.

Starting with destructive-only scenarios from episode 0 produces a cold-start problem: the policy fails every rollout, the group-relative advantage is zero, and GRPO cannot learn. Phasing them in after the warmup baseline is established avoids the cold-start without sacrificing the final capability.

6. Evaluation protocol

The held-out evaluation runs on seeds that are disjoint from both the training distribution and the warmup trace seeds. Three policies are compared on identical seeds:

Scripted baseline. A regex-driven heuristic that picks a safe read-only action (fs_ls, db_select, git_log) if one is available in the prompt, else draft_internal_memo. No model inference. Establishes the floor.
Supervised-warmup only. The SFT adapter loaded standalone. Measures what the warmup alone achieves.
RL-trained. The final GRPO adapter. Measures the uplift from the RL stage.

The eval has two tracks:

Standard track: 24 scenarios across the four primary tasks, each sampled from the standard (non-destructive-only) distribution.
Destructive-only track: 12 scenarios across the four destructive-outcome variants, with seeds pre-verified to resolve to R5.

All three policies see the same prompts and the same seeds. The reported numbers come from the standard track unless otherwise noted; the destructive-only track's role is to populate the R5 row of the confusion matrix so R5 recall is actually measured.

7. Reproducibility

Every deterministic choice that affects the final numbers is pinned:

pyproject.toml pins Python dependencies.
training/config.yaml pins hyperparameters with the values we ran.
training/generate_warmup_traces.py regenerates the 78 traces deterministically from the env (given a fixed scenario generator; see §2 on cross-process caveats).
tests/ catches regressions in both the env and the training glue code before they reach the GPU.
tools/validate_submission.py runs 94 compliance checks (OpenEnv API shape, file presence, endpoint availability, package metadata) and passes clean.

The Colab quickstart (notebooks/train_grpo_colab.ipynb) lets a reviewer re-run the full pipeline on a T4 in ~80 minutes, or pull the pre-trained adapter from the artifacts dataset in seconds.