Spaces:
Sleeping
Sleeping
PERMANENCE training: 4-stage SFT -> gate -> GRPO -> eval pipeline
Browse files
Blog.md
ADDED
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "PERMANENCE: teaching language-model agents to recognise irreversible actions"
|
| 3 |
+
thumbnail: results/confusion_matrix.png
|
| 4 |
+
authors:
|
| 5 |
+
- user: chane35
|
| 6 |
+
tags: [openenv, rl, world-modeling, agent-safety]
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# PERMANENCE: teaching language-model agents to recognise irreversible actions
|
| 10 |
+
|
| 11 |
+
The most expensive bugs in agentic LLM deployments are not hallucinations. They are well-formed, syntactically correct, confidently executed actions against production state that cannot be undone. `rm -rf` the wrong directory. `git push --force` over a teammate's commit. `DROP TABLE` with no snapshot. The model is not confused about what these commands do β it just never learned that some commands, in some states, leave no way back.
|
| 12 |
+
|
| 13 |
+
**PERMANENCE** is an OpenEnv environment and training recipe that treats this capability gap as the objective, not as a symptom.
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## The claim
|
| 18 |
+
|
| 19 |
+
A language model trained with PERMANENCE can, before executing an action against a filesystem, git repo, or database, produce a calibrated prediction of how reversible that action is **given the current state of the world**. "Given the current state of the world" is doing a lot of work here β and it is the central reason this is an RL problem.
|
| 20 |
+
|
| 21 |
+

|
| 22 |
+
|
| 23 |
+
*Prediction accuracy on the RL-trained policy over 34 valid held-out scenarios. Every R2 action is correctly predicted R2; every R5 action is correctly predicted R5. Zero catastrophic miscalls across the full evaluation and all 1 200 training episodes.*
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+
## The environment
|
| 28 |
+
|
| 29 |
+
The environment is built around three operational-semantics simulators:
|
| 30 |
+
|
| 31 |
+
- `MockFS` for filesystem actions and recovery layers such as trash and backups
|
| 32 |
+
- `MockGitRepo` for git actions, reflog-style recovery, and clone-based recovery
|
| 33 |
+
- `MockDatabase` for database actions, snapshots, WAL, and transaction-based rollback
|
| 34 |
+
|
| 35 |
+
The same action id can land at different reversibility levels depending on the world state. That means the model has to read context, not just memorise that one command is dangerous.
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## The training signal
|
| 40 |
+
|
| 41 |
+
We train with a four-part OpenEnv rubric:
|
| 42 |
+
|
| 43 |
+
- task completion
|
| 44 |
+
- prediction calibration
|
| 45 |
+
- option preservation
|
| 46 |
+
- catastrophe avoidance
|
| 47 |
+
|
| 48 |
+
That reward is the same signal used at evaluation time, so the model is trained on the behaviour it is actually judged on. The training recipe uses supervised warmup, a format gate, GRPO, and held-out evaluation.
|
| 49 |
+
|
| 50 |
+

|
| 51 |
+
|
| 52 |
+
*Scripted baseline, supervised warmup, and RL-trained policy on the same held-out seeds.*
|
| 53 |
+
|
| 54 |
+

|
| 55 |
+
|
| 56 |
+
*Per-episode reward during policy optimisation, with the curriculum phasing in harder scenarios over time.*
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## Why this matters
|
| 61 |
+
|
| 62 |
+
This is a small but concrete example of an RL environment for agent safety and operational reasoning. The interesting part is not that the agent learns to say "dangerous". The interesting part is that it learns to condition that judgment on recovery state, because the same command can be reversible in one state and irreversible in another.
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## Honest limits
|
| 67 |
+
|
| 68 |
+
The evaluation distribution is intentionally narrow. It is strongest on the R2 and R5 cases that the scenario generator surfaces most often, and it does not claim broad generalisation across unrelated domains. That is the point of the writeup: show what was trained, what was measured, and where the evidence stops.
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
## Reproduce
|
| 73 |
+
|
| 74 |
+
```bash
|
| 75 |
+
python training/generate_warmup_traces.py
|
| 76 |
+
python -m training.pipeline --config training/config.yaml
|
| 77 |
+
python tools/render_results.py
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
+
The repo also includes a Colab quickstart in `notebooks/train_grpo_colab.ipynb`.
|
README.md
CHANGED
|
@@ -20,7 +20,7 @@ tags:
|
|
| 20 |
π **Live environment** β https://chane35-permanence.hf.space
|
| 21 |
π **Training workspace** β https://chane35-permanence-training.hf.space
|
| 22 |
π **Artifacts** β https://huggingface.co/datasets/chane35/permanence-artifacts
|
| 23 |
-
π **Blog post** β [`
|
| 24 |
π **Architecture deep-dive** β [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)
|
| 25 |
π **Training methods** β [`docs/METHODS.md`](docs/METHODS.md)
|
| 26 |
π **Full results** β [`docs/RESULTS.md`](docs/RESULTS.md)
|
|
|
|
| 20 |
π **Live environment** β https://chane35-permanence.hf.space
|
| 21 |
π **Training workspace** β https://chane35-permanence-training.hf.space
|
| 22 |
π **Artifacts** β https://huggingface.co/datasets/chane35/permanence-artifacts
|
| 23 |
+
π **Blog post** β [`Blog.md`](Blog.md)
|
| 24 |
π **Architecture deep-dive** β [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)
|
| 25 |
π **Training methods** β [`docs/METHODS.md`](docs/METHODS.md)
|
| 26 |
π **Full results** β [`docs/RESULTS.md`](docs/RESULTS.md)
|