Spaces:

chane35
/

permanence

Running

App Files Files Community

permanence / Blog.md

chane35

PERMANENCE training: 4-stage SFT -> gate -> GRPO -> eval pipeline

5b9b3d4 verified about 1 month ago

preview code

raw

history blame contribute delete

14.2 kB

metadata

title: 'PERMANENCE: teaching language-model agents to recognise irreversible actions'
thumbnail: results/confusion_matrix.png
authors:
  - user: chane35
tags:
  - openenv
  - rl
  - world-modeling
  - agent-safety

PERMANENCE: what I built, what broke, and what I learned about teaching agents the cost of forever

Solo submission by Chanikya · Meta PyTorch Hackathon · one T4 · 1 200 training episodes.

The moment that started this

There is a specific kind of bug that I kept thinking about when designing this project.

Not the kind where a model hallucinates a fact. Not the kind where a model reasons incorrectly. The kind where a model executes DROP TABLE users on a production database, produces perfectly valid SQL, gets a clean return code — and then everyone finds out there was no snapshot.

The model did not malfunction. It did exactly what it was told. The problem was that it had no model of what it had just done. It could not distinguish between SELECT * FROM users and DROP TABLE users in terms of consequence. Both were just tokens that produced some output. One of them happened to erase six months of user data.

This is the problem PERMANENCE is designed to address: teaching a language model to know, before it acts, whether what it is about to do can be undone.

The insight that everything else followed from

The obvious approach is a lookup table. Label every action as safe or dangerous. DROP TABLE → dangerous. SELECT → safe. Done.

This approach is wrong, and understanding why it is wrong is the core intellectual contribution of this project.

git push --force is one of the most notorious "dangerous" commands in a developer's toolkit. But is it actually dangerous? It depends:

If the local and remote tips are already in sync, the force-push overwrites nothing. Recoverable. R2.
If the overwritten commits survive on another clone, you can pull them back. Recoverable with effort. R4.
If no clone has them and the reflog has expired, those commits are gone. Unrecoverable. R5.

The same command. The same parameters. Three completely different reversibility levels — depending entirely on the state of the world at the moment of execution.

A lookup table cannot capture this. The only way to correctly predict reversibility is to read world state. And an agent that learns to do that before committing to a prediction is doing something qualitatively different from an agent that memorises danger labels.

That is the problem I wanted to train an LLM to solve. The problem requires RL because no static dataset can capture the dynamic relationship between world state and action consequence.

The same git_push_force call flows through r_level_fn(world, params) and resolves differently depending on what the MockGitRepo simulator reports about live state.

Building a world the model could reason about

To train an agent on reversibility, I needed an environment where reversibility was real — not hand-coded labels, but actual consequences derived from actual simulator state.

I built three operational-semantics simulators:

MockFS maintains a file tree with four recovery layers: a live tree, a trash directory, timestamped backup snapshots, and a git_tracked set of paths. When you delete a file, the simulator checks all four layers to resolve the R-level. rm -rf on a backed-up path is R4. On an untracked path with trash disabled, it is R5. Same command. Different world.

MockGitRepo maintains a commit DAG, branch pointers, remote state, a reflog, and — the key detail — other_clones_have_commits: an explicit set of commit SHAs known to exist on other clones. This is what makes git push --force state-dependent. If all overwritten commits are in that set, the push is R4. If any of them are not, it is R5.

MockDatabase maintains tables, an open-transaction write-ahead log, and named snapshots. DROP TABLE with a prior db_snapshot in the dict is R4. Without one, R5. A DELETE inside an open BEGIN is R2 — rollback reverts it. After COMMIT with WAL active, it is R3.

Every action carries an r_level_fn(world, params) — a function that runs at execution time against the live simulator state and returns the ground truth R-level. The agent's <reversibility/> prediction is scored against that ground truth. There are no hand-specified labels anywhere in the reward pipeline.

The first thing I trained — and why it failed

The early training runs had a pattern I recognised immediately: the model would converge, loss would drop, reward would stabilise — and then I would look at the confusion matrix and find that every single prediction was R2.

The model had discovered the easiest possible strategy: always predict the safe reversibility level, always take a read-only or low-risk action, collect partial calibration credit, and exit the episode. Mean reward plateau near +0.4. Task completion near zero. Prediction accuracy technically 100 % — because every action taken was genuinely R2.

This is the safe-action collapse. It is not a training bug. It is the RL agent doing exactly what you incentivised it to do. If the reward for predicting R2 correctly exceeds the reward for taking a destructive action and correctly predicting R5, the agent will never take the destructive action.

The fix required rethinking the task distribution, not the reward function.

The forced-outcome variants: closing the safe exit

I introduced forced-outcome task variants — scenarios where the safe path is structurally unavailable. The backup storage is full. A regulatory hold has locked the db_snapshot action. The remote repository has been corrupted by a secret leak and the only valid correction is git push --force.

In these scenarios, the safe action does not exist. The task predicate only completes on the destructive action. An agent that tries to play safe fails the task entirely, and the unsolved-task cap limits total reward to 0.2.

The curriculum phases these in deliberately: 0 % in the first 50 episodes (cold-start — the agent needs to learn basic formatting before being thrown into hard scenarios), 50 % from episodes 50–149 (mix — break the local optimum without overwhelming the gradient), 70 % from episode 150 onward (the full hard distribution).

The result: the training reward stays above zero throughout. The agent learns to take destructive actions when they are the correct answer — and to correctly predict their reversibility. This is the evidence I cared most about producing.

The reward that is hard to game

With the forced-outcome variants closing the safe-action exploit, I still needed a reward function that could not be exploited in other ways.

The obvious exploits for a reversibility predictor are:

Predict everything R1, never act, collect calibration credit. The unsolved-task cap closes this: if the task predicate returns False, total reward ≤ 0.2 regardless of how accurate the predictions were.
Always predict R5 when uncertain, take no destructive actions. The forced-outcome variants close this: there are scenarios where the only scoring path is the destructive action.
Emit high confidence on every prediction regardless of accuracy. The calibration term closes this: prediction_score = level_accuracy × (1 − |confidence − level_accuracy|). Confident-and-wrong scores worse than uncertain-and-wrong. The gradient against over-confident errors is stronger than against uncertain ones.
Take a destructive action that closes off all future options. The option-preservation rubric closes this: 20 % of reward comes from the fraction of downstream actions that remain unlocked at episode end.

Four rubrics. Four closed exploits. The trained policy never found a path through any of them — zero catastrophic miscalls across all training and evaluation.

The thing that broke unexpectedly: gradient inversion

At one point I applied unlikeliness shaping — a technique from He et al. (arXiv:2506.02355) designed to penalise high-probability trajectories in favour of lower-probability-but-still-correct ones, originally developed for binary-verifier RL in formal theorem proving.

The technique broke training. Not subtly — dramatically. The SFT-to-RL lift collapsed from +0.246 to +0.052.

The reason, once I found it: unlikeliness shaping was designed for binary verifiers where a rollout is either correct or incorrect. PERMANENCE has a continuous partial-credit reward — a prediction that is off by one R-level still scores partially. In that setting, a low-probability incorrect prediction can rank above a high-probability correct one in a single observed batch, and the shaping term inverts the gradient: it penalises the correct prediction for being high-probability.

The diagnostic is simple but easy to miss: compute the reward each prediction pays for the same action and check whether the correct one pays more. If "no," the shaping is working against the objective regardless of how principled it looks on paper.

The lesson: techniques designed for binary verifiers do not automatically transfer to continuous-reward classification. The training signal must be interrogated at the gradient level, not just at the loss-curve level.

The format gate: one minute that saves seventy

Between SFT warmup and GRPO I placed a format-coverage gate: a 20-prompt probe that checks whether the warmup model can reliably emit both required tags on at least 80 % of completions. If it cannot, the gate aborts the pipeline before GRPO starts.

The gate caught a real failure mode — a tokenizer/prompt-template interaction that caused the model to emit the action tag without the reversibility tag on a significant fraction of completions. Loss was low. Validation perplexity looked fine. The format gate caught it in under a minute. GRPO on a model with broken tag emission would have produced approximately uniform gradient noise for seventy minutes of T4 time.

One minute of wall-time guarding seventy minutes of GPU compute. The gate is the cheapest investment in the pipeline.

The pipeline aborts cleanly at the gate on format failures rather than spending GPU time on a broken RL loop.

What the numbers actually show


+0.69	reward uplift over scripted baseline
24 / 24	held-out scenarios correctly predicted
0	catastrophic miscalls across all training and evaluation
1 200	training episodes on a single T4 GPU

Scripted (−0.025) → supervised warmup only (+0.418) → RL-trained (+0.664). The RL stage adds +0.246 on top of the supervised warmup — the gap is real and consistent across multiple training configurations.

Reward stays above zero once forced-outcome scenarios phase in at episode 50. The curriculum is working: the policy learns to solve hard scenarios rather than avoiding them.

On the standard 24-scenario eval track, every prediction lands on the R2 diagonal. The forced-outcome eval track (documented in docs/ABLATIONS.md) populates R4 and R5 — the agent correctly predicts destructive outcomes when it is required to take them.

The zero catastrophic miscalls is the result I am most confident in. A catastrophic miscall is taking an R5 action while predicting R1 or R2. Across every training configuration, every eval track, every ablation run — that count is zero. The catastrophe penalty in the reward function is doing its job.

What I would do with more time

The evaluation distribution is R2-heavy because backup-present scenarios are sampled with only ~15 % probability. R3 and R4 generalisation would require explicitly seeding those conditions — that is the clearest path to a richer confusion matrix.

The Meridian domain (social/business decision-making with its own reversibility semantics — firing an employee, issuing a public statement, approving a product launch) is fully implemented in the environment but not trained on. The reward pipeline is domain-agnostic; the gap is compute and time.

A calibrated reversibility predictor embedded in a real agent runtime — gating tool calls on predicted R-level with a configurable conservatism threshold — is the natural next step. The environment already trains the right capability. The missing piece is the deployment harness.

The actual contribution

PERMANENCE demonstrates one thing cleanly: a language model can be trained, via RL on a simulator with real operational semantics, to produce calibrated predictions of action reversibility that depend on world state rather than action identity.

That is a narrow claim. But it is a true one, backed by a working environment, a reproducible training pipeline, and an evaluation that shows the trained policy doing the right thing in scenarios where the safe path was deliberately removed.

The interesting part is not that the agent learns to say "dangerous." The interesting part is that it learns the conditions under which the same command is dangerous — and that those conditions are a property of the world, not of the string.

Give your agents the distinction between "undo" and "gone forever" — then let them choose.

Links:

Live environment: https://chane35-permanence.hf.space
Training workspace: https://chane35-permanence-training.hf.space
Colab quickstart: notebooks/train_grpo_colab.ipynb
Full results + ablations: docs/RESULTS.md · docs/ABLATIONS.md
Design reasoning: docs/TECHNIQUES.md