# PERMANENCE — Training Methodology

This document explains the methodological choices behind the
training pipeline and why they are made. It is intended for
reviewers who want to understand the research decisions, and for
practitioners who want to port the recipe to a different env.

---

## 1. Why not pure supervised fine-tuning

The obvious first try is to generate a dataset of
`(prompt, gold_completion)` pairs and do SFT. We rejected that
approach for three reasons:

1. **Calibration cannot be supervised from demonstrations alone.**
   The reward term
   `level_accuracy × (1 − |confidence − level_accuracy|)` scores
   the *confidence* the model emits. Demonstration traces force a
   single confidence value per example, which is not the same as
   teaching the model how its confidence should vary across
   examples. RL optimises this distributionally.

2. **Destructive-outcome scenarios need exploration.** In the
   variants where the normally-safe action is disabled, the
   policy has to discover that the destructive action is now the
   correct one. A supervised dataset that demonstrates the
   destructive action would just teach "when prompt contains
   'URGENT' → do the destructive action", which the policy would
   over-fit. RL allows the policy to reach the same conclusion by
   trying both.

3. **Option preservation is a trajectory-level signal.** Whether
   an episode's early actions closed off downstream options can
   only be scored at episode end. GRPO's group-relative advantage
   over complete rollouts is the natural fit.

We do use SFT for warmup — see §2 — but only to teach the output
format and a bias toward producing well-formed R-level
predictions before RL optimises the policy.

---

## 2. SFT warmup: traces generated by the live environment

The warmup dataset is 78 traces spanning R1–R5. The traces are
**generated by stepping the live environment at trace-creation
time**:

```python
env = PermanenceEnv(config={"force_task": task_id})
obs, info = env.reset(seed=seed)
world = env._current_world_state
action = ACTION_REGISTRY[action_id]
resolved_r = action.r_level_fn(world, params)    # source of truth
completion = synthesise_completion(resolved_r, ...)
```

This matters because the env's scenario generator is stochastic
with respect to pre-existing backups, snapshots, and clone
preservation. A fixed "seed X → backup present" assumption would
break silently across processes with different `PYTHONHASHSEED`.
Resolving the R-level from the live env every time the trace is
regenerated eliminates this class of bug.

Distribution of the 78 traces: R1 = 22, R2 = 23, R3 = 3, R4 = 7,
R5 = 23. The underweight on R3 and R4 is acknowledged in the
README's "Honest limits" section; it reflects the scenario
generator's default distribution rather than a hidden preference.

---

## 3. Format-coverage gate

Between SFT and GRPO we run a gate: 20 held-out prompts, model
generates a completion for each, the gate checks that both
`<action/>` and `<reversibility/>` tags are present on at least
80 % of completions.

The gate exists because we saw two early pipeline failures in
which SFT converged to low loss but emitted malformed tags at
generation time (collision with the instruction-tuning prior).
Running the full GRPO stage on a malformed policy would burn ~60
minutes of GPU time for no useful signal. The gate catches this
in ~1 minute.

---

## 4. GRPO configuration

We use TRL's `GRPOTrainer` under Unsloth 4-bit quantisation with
LoRA rank 16. Settings worth explaining:

| Parameter | Value | Reason |
|---|---|---|
| `group_size` | 4 | Per-prompt rollout diversity; enough for the relative-advantage calculation to have non-zero variance on most prompts |
| `num_iterations` (μ) | 2 | Two inner PPO updates per generation batch. Trades a small amount of off-policy drift for faster convergence |
| `beta` (KL coefficient) | 0.04 | The TRL default. Higher β-values constrain the policy from drifting far from the SFT reference, which prevents a late-training "forgetting" failure mode where the policy loses previously-correct predictions as the curriculum phases in harder tasks |
| `temperature` | 0.85 | High enough that rollouts within a group differ meaningfully, so the group-relative advantage has a useful gradient |
| `total_episodes` | 300 prompts | 300 × 4 = 1 200 rollouts on a T4 in ~70 min |
| `max_completion_length` | 280 | Our completions are three short tags; longer budgets invite length-drift without improving signal |

### 4.1 On reward shaping

We **deliberately do not** shape the environmental reward beyond
a dynamic weighting that phases the format reward out between
episodes 60 and 150. Every other signal the policy sees during
GRPO is the same four-component rubric it will be evaluated on.

We considered an "unlikeliness" shaping term (reward rare correct
solutions more) but removed it after observing that the technique
is designed for binary-verifier tasks like theorem proving. In a
**continuous-reward classification** task like ours, where
partial credit means the top-ranked reward sample is usually the
correct one, the shaping penalises correctness. The clearest
diagnostic was a single metric from a pilot run:

```
db_snapshot (actual R-level R2):
  predicted R1 → avg shaped reward 0.773
  predicted R2 → avg shaped reward 0.751
```

The shaping inverted the gradient. Disabling it restored the
expected ordering
(`correct R2 > incorrect R1`), which we verified by a quick sanity
check over 4 sample rollouts before committing to the change. The
general principle — match the training signal to the evaluation
signal, don't add gradient pressure you will not measure — is the
methodological guidance we ship here.

### 4.2 Length monitor

Independently of the reward architecture, the pipeline tracks the
rolling-window mean completion length. If it exceeds 1 000
characters for three consecutive windows, the callback aborts
training with a clean error. This caught two early failure modes
where the policy drifted into verbose explanation blocks (+3 ×
completion length, −50 % throughput) that are penalised by the
format rubric but not enough to outweigh the GRPO advantage from
the occasional correct solution in the long tail. The monitor
aborts those runs cleanly instead of letting them burn the full
GPU budget.

---

## 5. Curriculum

The task sampler follows a three-phase curriculum:

| Episodes | Composition |
|---|---|
| 0 – 49 | Standard tasks only. The policy establishes a baseline on the familiar distribution. |
| 50 – 149 | 50 % destructive-outcome variants. The policy is exposed to the tasks where the normally-safe action is unavailable. |
| 150 – 299 | 70 % destructive-outcome variants. The policy is pushed to solve the hard distribution. |

Starting with destructive-only scenarios from episode 0 produces
a cold-start problem: the policy fails every rollout, the
group-relative advantage is zero, and GRPO cannot learn. Phasing
them in after the warmup baseline is established avoids the
cold-start without sacrificing the final capability.

---

## 6. Evaluation protocol

The held-out evaluation runs on seeds that are disjoint from both
the training distribution and the warmup trace seeds. Three
policies are compared on identical seeds:

1. **Scripted baseline.** A regex-driven heuristic that picks a
   safe read-only action (`fs_ls`, `db_select`, `git_log`) if one
   is available in the prompt, else `draft_internal_memo`. No
   model inference. Establishes the floor.
2. **Supervised-warmup only.** The SFT adapter loaded standalone.
   Measures what the warmup alone achieves.
3. **RL-trained.** The final GRPO adapter. Measures the uplift
   from the RL stage.

The eval has two tracks:

- **Standard track**: 24 scenarios across the four primary tasks,
  each sampled from the standard (non-destructive-only)
  distribution.
- **Destructive-only track**: 12 scenarios across the four
  destructive-outcome variants, with seeds pre-verified to
  resolve to R5.

All three policies see the same prompts and the same seeds. The
reported numbers come from the standard track unless otherwise
noted; the destructive-only track's role is to populate the R5
row of the confusion matrix so R5 recall is actually measured.

---

## 7. Reproducibility

Every deterministic choice that affects the final numbers is
pinned:

- `pyproject.toml` pins Python dependencies.
- `training/config.yaml` pins hyperparameters with the values we
  ran.
- `training/generate_warmup_traces.py` regenerates the 78 traces
  deterministically from the env (given a fixed scenario
  generator; see §2 on cross-process caveats).
- `tests/` catches regressions in both the env and the training
  glue code before they reach the GPU.
- `tools/validate_submission.py` runs 94 compliance checks
  (OpenEnv API shape, file presence, endpoint availability,
  package metadata) and passes clean.

The Colab quickstart (`notebooks/train_grpo_colab.ipynb`) lets a
reviewer re-run the full pipeline on a T4 in ~80 minutes, or pull
the pre-trained adapter from the artifacts dataset in seconds.