Spaces:

chane35
/

permanence

Sleeping

App Files Files Community

permanence / Blog.md

chane35

PERMANENCE training: 4-stage SFT -> gate -> GRPO -> eval pipeline

e5ed352 verified about 1 month ago

3.79 kB

title: 'PERMANENCE: teaching language-model agents to recognise irreversible actions'
thumbnail: results/confusion_matrix.png
authors:
  - user: chane35
tags:
  - openenv
  - rl
  - world-modeling
  - agent-safety

PERMANENCE: teaching language-model agents to recognise irreversible actions

The most expensive bugs in agentic LLM deployments are not hallucinations. They are well-formed, syntactically correct, confidently executed actions against production state that cannot be undone. rm -rf the wrong directory. git push --force over a teammate's commit. DROP TABLE with no snapshot. The model is not confused about what these commands do — it just never learned that some commands, in some states, leave no way back.

PERMANENCE is an OpenEnv environment and training recipe that treats this capability gap as the objective, not as a symptom.

The claim

A language model trained with PERMANENCE can, before executing an action against a filesystem, git repo, or database, produce a calibrated prediction of how reversible that action is given the current state of the world. "Given the current state of the world" is doing a lot of work here — and it is the central reason this is an RL problem.

Prediction accuracy on the RL-trained policy over 34 valid held-out scenarios. Every R2 action is correctly predicted R2; every R5 action is correctly predicted R5. Zero catastrophic miscalls across the full evaluation and all 1 200 training episodes.

The environment

The environment is built around three operational-semantics simulators:

MockFS for filesystem actions and recovery layers such as trash and backups
MockGitRepo for git actions, reflog-style recovery, and clone-based recovery
MockDatabase for database actions, snapshots, WAL, and transaction-based rollback

The same action id can land at different reversibility levels depending on the world state. That means the model has to read context, not just memorise that one command is dangerous.

The training signal

We train with a four-part OpenEnv rubric:

task completion
prediction calibration
option preservation
catastrophe avoidance

That reward is the same signal used at evaluation time, so the model is trained on the behaviour it is actually judged on. The training recipe uses supervised warmup, a format gate, GRPO, and held-out evaluation.

Scripted baseline, supervised warmup, and RL-trained policy on the same held-out seeds.

Per-episode reward during policy optimisation, with the curriculum phasing in harder scenarios over time.

Why this matters

This is a small but concrete example of an RL environment for agent safety and operational reasoning. The interesting part is not that the agent learns to say "dangerous". The interesting part is that it learns to condition that judgment on recovery state, because the same command can be reversible in one state and irreversible in another.

Honest limits

The evaluation distribution is intentionally narrow. It is strongest on the R2 and R5 cases that the scenario generator surfaces most often, and it does not claim broad generalisation across unrelated domains. That is the point of the writeup: show what was trained, what was measured, and where the evidence stops.

Reproduce

python training/generate_warmup_traces.py
python -m training.pipeline --config training/config.yaml
python tools/render_results.py

The repo also includes a Colab quickstart in notebooks/train_grpo_colab.ipynb.