Spaces:
Paused
PERMANENCE β Training Methodology
This document explains the methodological choices behind the training pipeline and why they are made. It is intended for reviewers who want to understand the research decisions, and for practitioners who want to port the recipe to a different env.
1. Why not pure supervised fine-tuning
The obvious first try is to generate a dataset of
(prompt, gold_completion) pairs and do SFT. We rejected that
approach for three reasons:
Calibration cannot be supervised from demonstrations alone. The reward term
level_accuracy Γ (1 β |confidence β level_accuracy|)scores the confidence the model emits. Demonstration traces force a single confidence value per example, which is not the same as teaching the model how its confidence should vary across examples. RL optimises this distributionally.Destructive-outcome scenarios need exploration. In the variants where the normally-safe action is disabled, the policy has to discover that the destructive action is now the correct one. A supervised dataset that demonstrates the destructive action would just teach "when prompt contains 'URGENT' β do the destructive action", which the policy would over-fit. RL allows the policy to reach the same conclusion by trying both.
Option preservation is a trajectory-level signal. Whether an episode's early actions closed off downstream options can only be scored at episode end. GRPO's group-relative advantage over complete rollouts is the natural fit.
We do use SFT for warmup β see Β§2 β but only to teach the output format and a bias toward producing well-formed R-level predictions before RL optimises the policy.
2. SFT warmup: traces generated by the live environment
The warmup dataset is 78 traces spanning R1βR5. The traces are generated by stepping the live environment at trace-creation time:
env = PermanenceEnv(config={"force_task": task_id})
obs, info = env.reset(seed=seed)
world = env._current_world_state
action = ACTION_REGISTRY[action_id]
resolved_r = action.r_level_fn(world, params) # source of truth
completion = synthesise_completion(resolved_r, ...)
This matters because the env's scenario generator is stochastic
with respect to pre-existing backups, snapshots, and clone
preservation. A fixed "seed X β backup present" assumption would
break silently across processes with different PYTHONHASHSEED.
Resolving the R-level from the live env every time the trace is
regenerated eliminates this class of bug.
Distribution of the 78 traces: R1 = 22, R2 = 23, R3 = 3, R4 = 7, R5 = 23. The underweight on R3 and R4 is acknowledged in the README's "Honest limits" section; it reflects the scenario generator's default distribution rather than a hidden preference.
3. Format-coverage gate
Between SFT and GRPO we run a gate: 20 held-out prompts, model
generates a completion for each, the gate checks that both
<action/> and <reversibility/> tags are present on at least
80 % of completions.
The gate exists because we saw two early pipeline failures in which SFT converged to low loss but emitted malformed tags at generation time (collision with the instruction-tuning prior). Running the full GRPO stage on a malformed policy would burn ~60 minutes of GPU time for no useful signal. The gate catches this in ~1 minute.
4. GRPO configuration
We use TRL's GRPOTrainer under Unsloth 4-bit quantisation with
LoRA rank 16. Settings worth explaining:
| Parameter | Value | Reason |
|---|---|---|
group_size |
4 | Per-prompt rollout diversity; enough for the relative-advantage calculation to have non-zero variance on most prompts |
num_iterations (ΞΌ) |
2 | Two inner PPO updates per generation batch. Trades a small amount of off-policy drift for faster convergence |
beta (KL coefficient) |
0.04 | The TRL default. Higher Ξ²-values constrain the policy from drifting far from the SFT reference, which prevents a late-training "forgetting" failure mode where the policy loses previously-correct predictions as the curriculum phases in harder tasks |
temperature |
0.85 | High enough that rollouts within a group differ meaningfully, so the group-relative advantage has a useful gradient |
total_episodes |
300 prompts | 300 Γ 4 = 1 200 rollouts on a T4 in ~70 min |
max_completion_length |
280 | Our completions are three short tags; longer budgets invite length-drift without improving signal |
4.1 On reward shaping
We deliberately do not shape the environmental reward beyond a dynamic weighting that phases the format reward out between episodes 60 and 150. Every other signal the policy sees during GRPO is the same four-component rubric it will be evaluated on.
We considered an "unlikeliness" shaping term (reward rare correct solutions more) but removed it after observing that the technique is designed for binary-verifier tasks like theorem proving. In a continuous-reward classification task like ours, where partial credit means the top-ranked reward sample is usually the correct one, the shaping penalises correctness. The clearest diagnostic was a single metric from a pilot run:
db_snapshot (actual R-level R2):
predicted R1 β avg shaped reward 0.773
predicted R2 β avg shaped reward 0.751
The shaping inverted the gradient. Disabling it restored the
expected ordering
(correct R2 > incorrect R1), which we verified by a quick sanity
check over 4 sample rollouts before committing to the change. The
general principle β match the training signal to the evaluation
signal, don't add gradient pressure you will not measure β is the
methodological guidance we ship here.
4.2 Length monitor
Independently of the reward architecture, the pipeline tracks the rolling-window mean completion length. If it exceeds 1 000 characters for three consecutive windows, the callback aborts training with a clean error. This caught two early failure modes where the policy drifted into verbose explanation blocks (+3 Γ completion length, β50 % throughput) that are penalised by the format rubric but not enough to outweigh the GRPO advantage from the occasional correct solution in the long tail. The monitor aborts those runs cleanly instead of letting them burn the full GPU budget.
5. Curriculum
The task sampler follows a three-phase curriculum:
| Episodes | Composition |
|---|---|
| 0 β 49 | Standard tasks only. The policy establishes a baseline on the familiar distribution. |
| 50 β 149 | 50 % destructive-outcome variants. The policy is exposed to the tasks where the normally-safe action is unavailable. |
| 150 β 299 | 70 % destructive-outcome variants. The policy is pushed to solve the hard distribution. |
Starting with destructive-only scenarios from episode 0 produces a cold-start problem: the policy fails every rollout, the group-relative advantage is zero, and GRPO cannot learn. Phasing them in after the warmup baseline is established avoids the cold-start without sacrificing the final capability.
6. Evaluation protocol
The held-out evaluation runs on seeds that are disjoint from both the training distribution and the warmup trace seeds. Three policies are compared on identical seeds:
- Scripted baseline. A regex-driven heuristic that picks a
safe read-only action (
fs_ls,db_select,git_log) if one is available in the prompt, elsedraft_internal_memo. No model inference. Establishes the floor. - Supervised-warmup only. The SFT adapter loaded standalone. Measures what the warmup alone achieves.
- RL-trained. The final GRPO adapter. Measures the uplift from the RL stage.
The eval has two tracks:
- Standard track: 24 scenarios across the four primary tasks, each sampled from the standard (non-destructive-only) distribution.
- Destructive-only track: 12 scenarios across the four destructive-outcome variants, with seeds pre-verified to resolve to R5.
All three policies see the same prompts and the same seeds. The reported numbers come from the standard track unless otherwise noted; the destructive-only track's role is to populate the R5 row of the confusion matrix so R5 recall is actually measured.
7. Reproducibility
Every deterministic choice that affects the final numbers is pinned:
pyproject.tomlpins Python dependencies.training/config.yamlpins hyperparameters with the values we ran.training/generate_warmup_traces.pyregenerates the 78 traces deterministically from the env (given a fixed scenario generator; see Β§2 on cross-process caveats).tests/catches regressions in both the env and the training glue code before they reach the GPU.tools/validate_submission.pyruns 94 compliance checks (OpenEnv API shape, file presence, endpoint availability, package metadata) and passes clean.
The Colab quickstart (notebooks/train_grpo_colab.ipynb) lets a
reviewer re-run the full pipeline on a T4 in ~80 minutes, or pull
the pre-trained adapter from the artifacts dataset in seconds.