Spaces:

chane335
/

permanence-training

Paused

App Files Files Community

permanence-training / docs /METHODS.md

chane335

PERMANENCE: reversibility-aware RL environment for training LLM agents

8f27137 verified about 1 month ago

preview code

raw

history blame contribute delete

9.15 kB

	# PERMANENCE — Training Methodology

	This document explains the methodological choices behind the
	training pipeline and why they are made. It is intended for
	reviewers who want to understand the research decisions, and for
	practitioners who want to port the recipe to a different env.

	---

	## 1. Why not pure supervised fine-tuning

	The obvious first try is to generate a dataset of
	`(prompt, gold_completion)` pairs and do SFT. We rejected that
	approach for three reasons:

	1. Calibration cannot be supervised from demonstrations alone.
	The reward term
	`level_accuracy × (1 − \|confidence − level_accuracy\|)` scores
	the confidence the model emits. Demonstration traces force a
	single confidence value per example, which is not the same as
	teaching the model how its confidence should vary across
	examples. RL optimises this distributionally.

	2. Destructive-outcome scenarios need exploration. In the
	variants where the normally-safe action is disabled, the
	policy has to discover that the destructive action is now the
	correct one. A supervised dataset that demonstrates the
	destructive action would just teach "when prompt contains
	'URGENT' → do the destructive action", which the policy would
	over-fit. RL allows the policy to reach the same conclusion by
	trying both.

	3. Option preservation is a trajectory-level signal. Whether
	an episode's early actions closed off downstream options can
	only be scored at episode end. GRPO's group-relative advantage
	over complete rollouts is the natural fit.

	We do use SFT for warmup — see §2 — but only to teach the output
	format and a bias toward producing well-formed R-level
	predictions before RL optimises the policy.

	---

	## 2. SFT warmup: traces generated by the live environment

	The warmup dataset is 78 traces spanning R1–R5. The traces are
	**generated by stepping the live environment at trace-creation
	time**:

	```python
	env = PermanenceEnv(config={"force_task": task_id})
	obs, info = env.reset(seed=seed)
	world = env._current_world_state
	action = ACTION_REGISTRY[action_id]
	resolved_r = action.r_level_fn(world, params) # source of truth
	completion = synthesise_completion(resolved_r, ...)
	```

	This matters because the env's scenario generator is stochastic
	with respect to pre-existing backups, snapshots, and clone
	preservation. A fixed "seed X → backup present" assumption would
	break silently across processes with different `PYTHONHASHSEED`.
	Resolving the R-level from the live env every time the trace is
	regenerated eliminates this class of bug.

	Distribution of the 78 traces: R1 = 22, R2 = 23, R3 = 3, R4 = 7,
	R5 = 23. The underweight on R3 and R4 is acknowledged in the
	README's "Honest limits" section; it reflects the scenario
	generator's default distribution rather than a hidden preference.

	---

	## 3. Format-coverage gate

	Between SFT and GRPO we run a gate: 20 held-out prompts, model
	generates a completion for each, the gate checks that both
	`<action/>` and `<reversibility/>` tags are present on at least
	80 % of completions.

	The gate exists because we saw two early pipeline failures in
	which SFT converged to low loss but emitted malformed tags at
	generation time (collision with the instruction-tuning prior).
	Running the full GRPO stage on a malformed policy would burn ~60
	minutes of GPU time for no useful signal. The gate catches this
	in ~1 minute.

	---

	## 4. GRPO configuration

	We use TRL's `GRPOTrainer` under Unsloth 4-bit quantisation with
	LoRA rank 16. Settings worth explaining:

	\| Parameter \| Value \| Reason \|
	\|---\|---\|---\|
	\| `group_size` \| 4 \| Per-prompt rollout diversity; enough for the relative-advantage calculation to have non-zero variance on most prompts \|
	\| `num_iterations` (μ) \| 2 \| Two inner PPO updates per generation batch. Trades a small amount of off-policy drift for faster convergence \|
	\| `beta` (KL coefficient) \| 0.04 \| The TRL default. Higher β-values constrain the policy from drifting far from the SFT reference, which prevents a late-training "forgetting" failure mode where the policy loses previously-correct predictions as the curriculum phases in harder tasks \|
	\| `temperature` \| 0.85 \| High enough that rollouts within a group differ meaningfully, so the group-relative advantage has a useful gradient \|
	\| `total_episodes` \| 300 prompts \| 300 × 4 = 1 200 rollouts on a T4 in ~70 min \|
	\| `max_completion_length` \| 280 \| Our completions are three short tags; longer budgets invite length-drift without improving signal \|

	### 4.1 On reward shaping

	We deliberately do not shape the environmental reward beyond
	a dynamic weighting that phases the format reward out between
	episodes 60 and 150. Every other signal the policy sees during
	GRPO is the same four-component rubric it will be evaluated on.

	We considered an "unlikeliness" shaping term (reward rare correct
	solutions more) but removed it after observing that the technique
	is designed for binary-verifier tasks like theorem proving. In a
	continuous-reward classification task like ours, where
	partial credit means the top-ranked reward sample is usually the
	correct one, the shaping penalises correctness. The clearest
	diagnostic was a single metric from a pilot run:

	```
	db_snapshot (actual R-level R2):
	predicted R1 → avg shaped reward 0.773
	predicted R2 → avg shaped reward 0.751
	```

	The shaping inverted the gradient. Disabling it restored the
	expected ordering
	(`correct R2 > incorrect R1`), which we verified by a quick sanity
	check over 4 sample rollouts before committing to the change. The
	general principle — match the training signal to the evaluation
	signal, don't add gradient pressure you will not measure — is the
	methodological guidance we ship here.

	### 4.2 Length monitor

	Independently of the reward architecture, the pipeline tracks the
	rolling-window mean completion length. If it exceeds 1 000
	characters for three consecutive windows, the callback aborts
	training with a clean error. This caught two early failure modes
	where the policy drifted into verbose explanation blocks (+3 ×
	completion length, −50 % throughput) that are penalised by the
	format rubric but not enough to outweigh the GRPO advantage from
	the occasional correct solution in the long tail. The monitor
	aborts those runs cleanly instead of letting them burn the full
	GPU budget.

	---

	## 5. Curriculum

	The task sampler follows a three-phase curriculum:

	\| Episodes \| Composition \|
	\|---\|---\|
	\| 0 – 49 \| Standard tasks only. The policy establishes a baseline on the familiar distribution. \|
	\| 50 – 149 \| 50 % destructive-outcome variants. The policy is exposed to the tasks where the normally-safe action is unavailable. \|
	\| 150 – 299 \| 70 % destructive-outcome variants. The policy is pushed to solve the hard distribution. \|

	Starting with destructive-only scenarios from episode 0 produces
	a cold-start problem: the policy fails every rollout, the
	group-relative advantage is zero, and GRPO cannot learn. Phasing
	them in after the warmup baseline is established avoids the
	cold-start without sacrificing the final capability.

	---

	## 6. Evaluation protocol

	The held-out evaluation runs on seeds that are disjoint from both
	the training distribution and the warmup trace seeds. Three
	policies are compared on identical seeds:

	1. Scripted baseline. A regex-driven heuristic that picks a
	safe read-only action (`fs_ls`, `db_select`, `git_log`) if one
	is available in the prompt, else `draft_internal_memo`. No
	model inference. Establishes the floor.
	2. Supervised-warmup only. The SFT adapter loaded standalone.
	Measures what the warmup alone achieves.
	3. RL-trained. The final GRPO adapter. Measures the uplift
	from the RL stage.

	The eval has two tracks:

	- Standard track: 24 scenarios across the four primary tasks,
	each sampled from the standard (non-destructive-only)
	distribution.
	- Destructive-only track: 12 scenarios across the four
	destructive-outcome variants, with seeds pre-verified to
	resolve to R5.

	All three policies see the same prompts and the same seeds. The
	reported numbers come from the standard track unless otherwise
	noted; the destructive-only track's role is to populate the R5
	row of the confusion matrix so R5 recall is actually measured.

	---

	## 7. Reproducibility

	Every deterministic choice that affects the final numbers is
	pinned:

	- `pyproject.toml` pins Python dependencies.
	- `training/config.yaml` pins hyperparameters with the values we
	ran.
	- `training/generate_warmup_traces.py` regenerates the 78 traces
	deterministically from the env (given a fixed scenario
	generator; see §2 on cross-process caveats).
	- `tests/` catches regressions in both the env and the training
	glue code before they reach the GPU.
	- `tools/validate_submission.py` runs 94 compliance checks
	(OpenEnv API shape, file presence, endpoint availability,
	package metadata) and passes clean.

	The Colab quickstart (`notebooks/train_grpo_colab.ipynb`) lets a
	reviewer re-run the full pipeline on a T4 in ~80 minutes, or pull
	the pre-trained adapter from the artifacts dataset in seconds.