Spaces:
Sleeping
Sleeping
| # PERMANENCE — Training Methodology | |
| This document explains the methodological choices behind the | |
| training pipeline and why they are made. It is intended for | |
| reviewers who want to understand the research decisions, and for | |
| practitioners who want to port the recipe to a different env. | |
| --- | |
| ## 1. Why not pure supervised fine-tuning | |
| The obvious first try is to generate a dataset of | |
| `(prompt, gold_completion)` pairs and do SFT. We rejected that | |
| approach for three reasons: | |
| 1. **Calibration cannot be supervised from demonstrations alone.** | |
| The reward term | |
| `level_accuracy × (1 − |confidence − level_accuracy|)` scores | |
| the *confidence* the model emits. Demonstration traces force a | |
| single confidence value per example, which is not the same as | |
| teaching the model how its confidence should vary across | |
| examples. RL optimises this distributionally. | |
| 2. **Destructive-outcome scenarios need exploration.** In the | |
| variants where the normally-safe action is disabled, the | |
| policy has to discover that the destructive action is now the | |
| correct one. A supervised dataset that demonstrates the | |
| destructive action would just teach "when prompt contains | |
| 'URGENT' → do the destructive action", which the policy would | |
| over-fit. RL allows the policy to reach the same conclusion by | |
| trying both. | |
| 3. **Option preservation is a trajectory-level signal.** Whether | |
| an episode's early actions closed off downstream options can | |
| only be scored at episode end. GRPO's group-relative advantage | |
| over complete rollouts is the natural fit. | |
| We do use SFT for warmup — see §2 — but only to teach the output | |
| format and a bias toward producing well-formed R-level | |
| predictions before RL optimises the policy. | |
| --- | |
| ## 2. SFT warmup: traces generated by the live environment | |
| The warmup dataset is 78 traces spanning R1–R5. The traces are | |
| **generated by stepping the live environment at trace-creation | |
| time**: | |
| ```python | |
| env = PermanenceEnv(config={"force_task": task_id}) | |
| obs, info = env.reset(seed=seed) | |
| world = env._current_world_state | |
| action = ACTION_REGISTRY[action_id] | |
| resolved_r = action.r_level_fn(world, params) # source of truth | |
| completion = synthesise_completion(resolved_r, ...) | |
| ``` | |
| This matters because the env's scenario generator is stochastic | |
| with respect to pre-existing backups, snapshots, and clone | |
| preservation. A fixed "seed X → backup present" assumption would | |
| break silently across processes with different `PYTHONHASHSEED`. | |
| Resolving the R-level from the live env every time the trace is | |
| regenerated eliminates this class of bug. | |
| Distribution of the 78 traces: R1 = 22, R2 = 23, R3 = 3, R4 = 7, | |
| R5 = 23. The underweight on R3 and R4 is acknowledged in the | |
| README's "Honest limits" section; it reflects the scenario | |
| generator's default distribution rather than a hidden preference. | |
| --- | |
| ## 3. Format-coverage gate | |
| Between SFT and GRPO we run a gate: 20 held-out prompts, model | |
| generates a completion for each, the gate checks that both | |
| `<action/>` and `<reversibility/>` tags are present on at least | |
| 80 % of completions. | |
| The gate exists because we saw two early pipeline failures in | |
| which SFT converged to low loss but emitted malformed tags at | |
| generation time (collision with the instruction-tuning prior). | |
| Running the full GRPO stage on a malformed policy would burn ~60 | |
| minutes of GPU time for no useful signal. The gate catches this | |
| in ~1 minute. | |
| --- | |
| ## 4. GRPO configuration | |
| We use TRL's `GRPOTrainer` under Unsloth 4-bit quantisation with | |
| LoRA rank 16. Settings worth explaining: | |
| | Parameter | Value | Reason | | |
| |---|---|---| | |
| | `group_size` | 4 | Per-prompt rollout diversity; enough for the relative-advantage calculation to have non-zero variance on most prompts | | |
| | `num_iterations` (μ) | 2 | Two inner PPO updates per generation batch. Trades a small amount of off-policy drift for faster convergence | | |
| | `beta` (KL coefficient) | 0.04 | The TRL default. Higher β-values constrain the policy from drifting far from the SFT reference, which prevents a late-training "forgetting" failure mode where the policy loses previously-correct predictions as the curriculum phases in harder tasks | | |
| | `temperature` | 0.85 | High enough that rollouts within a group differ meaningfully, so the group-relative advantage has a useful gradient | | |
| | `total_episodes` | 300 prompts | 300 × 4 = 1 200 rollouts on a T4 in ~70 min | | |
| | `max_completion_length` | 280 | Our completions are three short tags; longer budgets invite length-drift without improving signal | | |
| ### 4.1 On reward shaping | |
| We **deliberately do not** shape the environmental reward beyond | |
| a dynamic weighting that phases the format reward out between | |
| episodes 60 and 150. Every other signal the policy sees during | |
| GRPO is the same four-component rubric it will be evaluated on. | |
| We considered an "unlikeliness" shaping term (reward rare correct | |
| solutions more) but removed it after observing that the technique | |
| is designed for binary-verifier tasks like theorem proving. In a | |
| **continuous-reward classification** task like ours, where | |
| partial credit means the top-ranked reward sample is usually the | |
| correct one, the shaping penalises correctness. The clearest | |
| diagnostic was a single metric from a pilot run: | |
| ``` | |
| db_snapshot (actual R-level R2): | |
| predicted R1 → avg shaped reward 0.773 | |
| predicted R2 → avg shaped reward 0.751 | |
| ``` | |
| The shaping inverted the gradient. Disabling it restored the | |
| expected ordering | |
| (`correct R2 > incorrect R1`), which we verified by a quick sanity | |
| check over 4 sample rollouts before committing to the change. The | |
| general principle — match the training signal to the evaluation | |
| signal, don't add gradient pressure you will not measure — is the | |
| methodological guidance we ship here. | |
| ### 4.2 Length monitor | |
| Independently of the reward architecture, the pipeline tracks the | |
| rolling-window mean completion length. If it exceeds 1 000 | |
| characters for three consecutive windows, the callback aborts | |
| training with a clean error. This caught two early failure modes | |
| where the policy drifted into verbose explanation blocks (+3 × | |
| completion length, −50 % throughput) that are penalised by the | |
| format rubric but not enough to outweigh the GRPO advantage from | |
| the occasional correct solution in the long tail. The monitor | |
| aborts those runs cleanly instead of letting them burn the full | |
| GPU budget. | |
| --- | |
| ## 5. Curriculum | |
| The task sampler follows a three-phase curriculum: | |
| | Episodes | Composition | | |
| |---|---| | |
| | 0 – 49 | Standard tasks only. The policy establishes a baseline on the familiar distribution. | | |
| | 50 – 149 | 50 % destructive-outcome variants. The policy is exposed to the tasks where the normally-safe action is unavailable. | | |
| | 150 – 299 | 70 % destructive-outcome variants. The policy is pushed to solve the hard distribution. | | |
| Starting with destructive-only scenarios from episode 0 produces | |
| a cold-start problem: the policy fails every rollout, the | |
| group-relative advantage is zero, and GRPO cannot learn. Phasing | |
| them in after the warmup baseline is established avoids the | |
| cold-start without sacrificing the final capability. | |
| --- | |
| ## 6. Evaluation protocol | |
| The held-out evaluation runs on seeds that are disjoint from both | |
| the training distribution and the warmup trace seeds. Three | |
| policies are compared on identical seeds: | |
| 1. **Scripted baseline.** A regex-driven heuristic that picks a | |
| safe read-only action (`fs_ls`, `db_select`, `git_log`) if one | |
| is available in the prompt, else `draft_internal_memo`. No | |
| model inference. Establishes the floor. | |
| 2. **Supervised-warmup only.** The SFT adapter loaded standalone. | |
| Measures what the warmup alone achieves. | |
| 3. **RL-trained.** The final GRPO adapter. Measures the uplift | |
| from the RL stage. | |
| The eval has two tracks: | |
| - **Standard track**: 24 scenarios across the four primary tasks, | |
| each sampled from the standard (non-destructive-only) | |
| distribution. | |
| - **Destructive-only track**: 12 scenarios across the four | |
| destructive-outcome variants, with seeds pre-verified to | |
| resolve to R5. | |
| All three policies see the same prompts and the same seeds. The | |
| reported numbers come from the standard track unless otherwise | |
| noted; the destructive-only track's role is to populate the R5 | |
| row of the confusion matrix so R5 recall is actually measured. | |
| --- | |
| ## 7. Reproducibility | |
| Every deterministic choice that affects the final numbers is | |
| pinned: | |
| - `pyproject.toml` pins Python dependencies. | |
| - `training/config.yaml` pins hyperparameters with the values we | |
| ran. | |
| - `training/generate_warmup_traces.py` regenerates the 78 traces | |
| deterministically from the env (given a fixed scenario | |
| generator; see §2 on cross-process caveats). | |
| - `tests/` catches regressions in both the env and the training | |
| glue code before they reach the GPU. | |
| - `tools/validate_submission.py` runs 94 compliance checks | |
| (OpenEnv API shape, file presence, endpoint availability, | |
| package metadata) and passes clean. | |
| The Colab quickstart (`notebooks/train_grpo_colab.ipynb`) lets a | |
| reviewer re-run the full pipeline on a T4 in ~80 minutes, or pull | |
| the pre-trained adapter from the artifacts dataset in seconds. | |