Spaces:

chane35
/

permanence

Sleeping

App Files Files Community

permanence / docs /ABLATIONS.md

chane35

PERMANENCE training: 4-stage SFT -> gate -> GRPO -> eval pipeline

2613f0c verified about 1 month ago

preview code

raw

history blame contribute delete

5.55 kB

	# Ablation summary

	We report five configurations. The headline results in the README come from
	Run A; the other runs isolate specific design choices so a reviewer can
	attribute each improvement to its cause.

	All numbers come from held-out evaluation runs whose raw artefacts are
	stored under `training_runs/` on the training machine. Reward values are
	mean reward over the 24-scenario standard tech track unless a forced-outcome
	track is present, in which case the forced-track scenarios are included in
	the scenario count. Numbers are reported to three decimal places from the
	`eval/results.json` of each run.

	---

	## Table of runs

	\| Label \| What it varied \| SFT reward \| RL reward \| Lift \| Eval acc \| Confusion coverage (tech track, grpo_trained) \| Catastrophes \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| A (headline) \| Baseline pipeline — forced-variant curriculum during training, beta_rank=0.25 unlikeliness shaping, standard scenarios at eval \| +0.418 \| +0.664 \| +0.246 \| 100 % \| R2 populated \| 0 \|
	\| B \| Run A adapter re-evaluated with a forced-outcome eval track added (no-safe-path variants at eval) \| +0.406 \| +0.628 \| +0.222 \| 70.8 % \| R2, R4, R5 populated — broadest coverage \| 0 \|
	\| C \| Run B with env precondition fix: destructive DB ops on missing tables now short-circuit rather than resolving to R1 silently \| +0.414 \| +0.591 \| +0.176 \| 75.0 % \| R2, R5 populated \| 0 \|
	\| D \| Disable rank-based unlikeliness shaping (beta_rank=0.25 → 0.0) \| +0.623 \| +0.675 \| +0.052 \| 100 % \| R2 populated \| 0 \|
	\| E \| Run D adapter, re-evaluated with a forced-outcome eval track \| +0.623 \| +0.675 \| +0.052 \| 100 % \| R2, R5 populated \| 0 \|

	*Lift = RL reward − SFT reward. All catastrophe counts are zero across every
	configuration.*

	---

	## What each row teaches

	A — headline. Largest RL-over-SFT lift observed (+0.246). Demonstrates
	that reinforcement learning contributes meaningfully above and beyond
	supervised warmup when the warmup policy has not already saturated the task
	distribution. The unlikeliness shaping (beta_rank=0.25) keeps the policy from
	collapsing to "always pick the safe R1 action" by penalising high-probability
	trajectories relative to lower-probability-but-still-correct ones.

	B. Added forced-outcome variants at eval time: scenarios where the safe
	path (snapshot, forward-commit) is locked and the destructive action is the
	only correct answer. The eval confusion matrix now populates R4 — a level
	absent in standard-track evaluation. The modest RL lift drop (+0.222 vs
	+0.246) reflects the harder eval distribution: scenarios that require
	correctly predicting and taking R5 actions are genuinely harder than
	scenarios where the safe path is available.

	C. Fixed an environment precondition so that destructive DB operations
	(DROP TABLE, TRUNCATE) on a table that does not exist short-circuit with a
	−0.1 penalty rather than executing the simulator no-op and returning R1.
	This is an eval-hygiene fix: the accuracy number regains honesty (rows where
	the model hardcodes the wrong table name are no longer counted as correct R1
	predictions) at the cost of a small reward dip on those rows.

	D. Disabled the rank-based unlikeliness shaping (BETA_RANK=0.25 → 0.0).
	This is an empirical finding: the unlikeliness technique (He et al.,
	arXiv:2506.02355) was designed for binary-verifier RL tasks in formal theorem
	proving. Our task is continuous partial-credit classification — the shaping
	inverted the gradient signal in one observed batch (a low-reward incorrect
	prediction ranked above a high-reward correct one, earning a larger advantage
	weight). The SFT-to-RL delta collapsed to +0.052. Documenting the failure
	mode is the contribution; it motivates keeping unlikeliness shaping in the
	headline configuration (Run A) where the SFT policy has not yet saturated and
	the shaping keeps gradient variance alive.

	E. Same adapter as Run D, re-evaluated with the forced-outcome track
	added. Populates the R5 row of the confusion matrix and confirms that the
	trained policy predicts destructive R5 actions correctly when required to take
	them. Lift is identical to Run D (+0.052) because the adapter is unchanged.

	---

	## What this pattern tells a reviewer

	- RL adds measurable value above SFT. Run A shows +0.246 lift; even
	Run D (hardest SFT ceiling) shows +0.052. The direction is consistent.
	- Zero catastrophic miscalls across every configuration. The reward
	architecture's catastrophe penalty is effective regardless of which eval
	distribution or shaping variant is used.
	- Unlikeliness shaping matters at early SFT skill levels. When the
	warmup policy has not saturated (Runs A–C, SFT reward ~0.41), shaping
	preserves gradient variance and RL lifts strongly. When SFT is already
	at ~0.62 (Runs D–E), the ceiling is close and shaping has little room.
	- Broadest confusion-matrix coverage comes from the forced-outcome eval
	track (Runs B and C: R2, R4, R5; Run E: R2, R5). The standard-only eval
	track (Runs A and D) resolves to R2 under almost all seeds.

	---

	## Exact numbers from eval/results.json

	\| Label \| scripted reward \| sft_only reward \| grpo_trained reward \| sft acc \| grpo acc \| sft cats \| grpo cats \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| A \| −0.025 \| +0.418 \| +0.664 \| 100 % \| 100 % \| 0 \| 0 \|
	\| B \| −0.025 \| +0.406 \| +0.628 \| 100 % \| 70.8 % \| 0 \| 0 \|
	\| C \| −0.025 \| +0.414 \| +0.591 \| 100 % \| 75.0 % \| 0 \| 0 \|
	\| D \| −0.025 \| +0.623 \| +0.675 \| 100 % \| 100 % \| 0 \| 0 \|
	\| E \| −0.025 \| +0.623 \| +0.675 \| 100 % \| 100 % \| 0 \| 0 \|