| # PERMANENCE — Training Methodology |
|
|
| This document explains the methodological choices behind the |
| training pipeline and why they are made. It is intended for |
| reviewers who want to understand the research decisions, and for |
| practitioners who want to port the recipe to a different env. |
|
|
| --- |
|
|
| ## 1. Why not pure supervised fine-tuning |
|
|
| The obvious first try is to generate a dataset of |
| `(prompt, gold_completion)` pairs and do SFT. We rejected that |
| approach for three reasons: |
|
|
| 1. **Calibration cannot be supervised from demonstrations alone.** |
| The reward term |
| `level_accuracy × (1 − |confidence − level_accuracy|)` scores |
| the *confidence* the model emits. Demonstration traces force a |
| single confidence value per example, which is not the same as |
| teaching the model how its confidence should vary across |
| examples. RL optimises this distributionally. |
|
|
| 2. **Destructive-outcome scenarios need exploration.** In the |
| variants where the normally-safe action is disabled, the |
| policy has to discover that the destructive action is now the |
| correct one. A supervised dataset that demonstrates the |
| destructive action would just teach "when prompt contains |
| 'URGENT' → do the destructive action", which the policy would |
| over-fit. RL allows the policy to reach the same conclusion by |
| trying both. |
|
|
| 3. **Option preservation is a trajectory-level signal.** Whether |
| an episode's early actions closed off downstream options can |
| only be scored at episode end. GRPO's group-relative advantage |
| over complete rollouts is the natural fit. |
|
|
| We do use SFT for warmup — see §2 — but only to teach the output |
| format and a bias toward producing well-formed R-level |
| predictions before RL optimises the policy. |
|
|
| --- |
|
|
| ## 2. SFT warmup: traces generated by the live environment |
|
|
| The warmup dataset is 78 traces spanning R1–R5. The traces are |
| **generated by stepping the live environment at trace-creation |
| time**: |
|
|
| ```python |
| env = PermanenceEnv(config={"force_task": task_id}) |
| obs, info = env.reset(seed=seed) |
| world = env._current_world_state |
| action = ACTION_REGISTRY[action_id] |
| resolved_r = action.r_level_fn(world, params) # source of truth |
| completion = synthesise_completion(resolved_r, ...) |
| ``` |
|
|
| This matters because the env's scenario generator is stochastic |
| with respect to pre-existing backups, snapshots, and clone |
| preservation. A fixed "seed X → backup present" assumption would |
| break silently across processes with different `PYTHONHASHSEED`. |
| Resolving the R-level from the live env every time the trace is |
| regenerated eliminates this class of bug. |
|
|
| Distribution of the 78 traces: R1 = 22, R2 = 23, R3 = 3, R4 = 7, |
| R5 = 23. The underweight on R3 and R4 is acknowledged in the |
| README's "Honest limits" section; it reflects the scenario |
| generator's default distribution rather than a hidden preference. |
|
|
| --- |
|
|
| ## 3. Format-coverage gate |
|
|
| Between SFT and GRPO we run a gate: 20 held-out prompts, model |
| generates a completion for each, the gate checks that both |
| `<action/>` and `<reversibility/>` tags are present on at least |
| 80 % of completions. |
|
|
| The gate exists because we saw two early pipeline failures in |
| which SFT converged to low loss but emitted malformed tags at |
| generation time (collision with the instruction-tuning prior). |
| Running the full GRPO stage on a malformed policy would burn ~60 |
| minutes of GPU time for no useful signal. The gate catches this |
| in ~1 minute. |
|
|
| --- |
|
|
| ## 4. GRPO configuration |
|
|
| We use TRL's `GRPOTrainer` under Unsloth 4-bit quantisation with |
| LoRA rank 16. Settings worth explaining: |
|
|
| | Parameter | Value | Reason | |
| |---|---|---| |
| | `group_size` | 4 | Per-prompt rollout diversity; enough for the relative-advantage calculation to have non-zero variance on most prompts | |
| | `num_iterations` (μ) | 2 | Two inner PPO updates per generation batch. Trades a small amount of off-policy drift for faster convergence | |
| | `beta` (KL coefficient) | 0.04 | The TRL default. Higher β-values constrain the policy from drifting far from the SFT reference, which prevents a late-training "forgetting" failure mode where the policy loses previously-correct predictions as the curriculum phases in harder tasks | |
| | `temperature` | 0.85 | High enough that rollouts within a group differ meaningfully, so the group-relative advantage has a useful gradient | |
| | `total_episodes` | 300 prompts | 300 × 4 = 1 200 rollouts on a T4 in ~70 min | |
| | `max_completion_length` | 280 | Our completions are three short tags; longer budgets invite length-drift without improving signal | |
|
|
| ### 4.1 On reward shaping |
|
|
| We **deliberately do not** shape the environmental reward beyond |
| a dynamic weighting that phases the format reward out between |
| episodes 60 and 150. Every other signal the policy sees during |
| GRPO is the same four-component rubric it will be evaluated on. |
|
|
| We considered an "unlikeliness" shaping term (reward rare correct |
| solutions more) but removed it after observing that the technique |
| is designed for binary-verifier tasks like theorem proving. In a |
| **continuous-reward classification** task like ours, where |
| partial credit means the top-ranked reward sample is usually the |
| correct one, the shaping penalises correctness. The clearest |
| diagnostic was a single metric from a pilot run: |
|
|
| ``` |
| db_snapshot (actual R-level R2): |
| predicted R1 → avg shaped reward 0.773 |
| predicted R2 → avg shaped reward 0.751 |
| ``` |
|
|
| The shaping inverted the gradient. Disabling it restored the |
| expected ordering |
| (`correct R2 > incorrect R1`), which we verified by a quick sanity |
| check over 4 sample rollouts before committing to the change. The |
| general principle — match the training signal to the evaluation |
| signal, don't add gradient pressure you will not measure — is the |
| methodological guidance we ship here. |
|
|
| ### 4.2 Length monitor |
|
|
| Independently of the reward architecture, the pipeline tracks the |
| rolling-window mean completion length. If it exceeds 1 000 |
| characters for three consecutive windows, the callback aborts |
| training with a clean error. This caught two early failure modes |
| where the policy drifted into verbose explanation blocks (+3 × |
| completion length, −50 % throughput) that are penalised by the |
| format rubric but not enough to outweigh the GRPO advantage from |
| the occasional correct solution in the long tail. The monitor |
| aborts those runs cleanly instead of letting them burn the full |
| GPU budget. |
|
|
| --- |
|
|
| ## 5. Curriculum |
|
|
| The task sampler follows a three-phase curriculum: |
|
|
| | Episodes | Composition | |
| |---|---| |
| | 0 – 49 | Standard tasks only. The policy establishes a baseline on the familiar distribution. | |
| | 50 – 149 | 50 % destructive-outcome variants. The policy is exposed to the tasks where the normally-safe action is unavailable. | |
| | 150 – 299 | 70 % destructive-outcome variants. The policy is pushed to solve the hard distribution. | |
|
|
| Starting with destructive-only scenarios from episode 0 produces |
| a cold-start problem: the policy fails every rollout, the |
| group-relative advantage is zero, and GRPO cannot learn. Phasing |
| them in after the warmup baseline is established avoids the |
| cold-start without sacrificing the final capability. |
|
|
| --- |
|
|
| ## 6. Evaluation protocol |
|
|
| The held-out evaluation runs on seeds that are disjoint from both |
| the training distribution and the warmup trace seeds. Three |
| policies are compared on identical seeds: |
|
|
| 1. **Scripted baseline.** A regex-driven heuristic that picks a |
| safe read-only action (`fs_ls`, `db_select`, `git_log`) if one |
| is available in the prompt, else `draft_internal_memo`. No |
| model inference. Establishes the floor. |
| 2. **Supervised-warmup only.** The SFT adapter loaded standalone. |
| Measures what the warmup alone achieves. |
| 3. **RL-trained.** The final GRPO adapter. Measures the uplift |
| from the RL stage. |
|
|
| The eval has two tracks: |
|
|
| - **Standard track**: 24 scenarios across the four primary tasks, |
| each sampled from the standard (non-destructive-only) |
| distribution. |
| - **Destructive-only track**: 12 scenarios across the four |
| destructive-outcome variants, with seeds pre-verified to |
| resolve to R5. |
|
|
| All three policies see the same prompts and the same seeds. The |
| reported numbers come from the standard track unless otherwise |
| noted; the destructive-only track's role is to populate the R5 |
| row of the confusion matrix so R5 recall is actually measured. |
|
|
| --- |
|
|
| ## 7. Reproducibility |
|
|
| Every deterministic choice that affects the final numbers is |
| pinned: |
|
|
| - `pyproject.toml` pins Python dependencies. |
| - `training/config.yaml` pins hyperparameters with the values we |
| ran. |
| - `training/generate_warmup_traces.py` regenerates the 78 traces |
| deterministically from the env (given a fixed scenario |
| generator; see §2 on cross-process caveats). |
| - `tests/` catches regressions in both the env and the training |
| glue code before they reach the GPU. |
| - `tools/validate_submission.py` runs 94 compliance checks |
| (OpenEnv API shape, file presence, endpoint availability, |
| package metadata) and passes clean. |
|
|
| The Colab quickstart (`notebooks/train_grpo_colab.ipynb`) lets a |
| reviewer re-run the full pipeline on a T4 in ~80 minutes, or pull |
| the pre-trained adapter from the artifacts dataset in seconds. |
|
|