Spaces:

chane35
/

permanence

Sleeping

App Files Files Community

chane35 commited on Apr 26

Commit

e5ed352

verified ·

1 Parent(s): f24e2c7

PERMANENCE training: 4-stage SFT -> gate -> GRPO -> eval pipeline

Browse files

Files changed (2) hide show

Blog.md +80 -0
README.md +1 -1

Blog.md ADDED Viewed

	@@ -0,0 +1,80 @@

+---
+title: "PERMANENCE: teaching language-model agents to recognise irreversible actions"
+thumbnail: results/confusion_matrix.png
+authors:
+  - user: chane35
+tags: [openenv, rl, world-modeling, agent-safety]
+---
+# PERMANENCE: teaching language-model agents to recognise irreversible actions
+The most expensive bugs in agentic LLM deployments are not hallucinations. They are well-formed, syntactically correct, confidently executed actions against production state that cannot be undone. `rm -rf` the wrong directory. `git push --force` over a teammate's commit. `DROP TABLE` with no snapshot. The model is not confused about what these commands do — it just never learned that some commands, in some states, leave no way back.
+**PERMANENCE** is an OpenEnv environment and training recipe that treats this capability gap as the objective, not as a symptom.
+---
+## The claim
+A language model trained with PERMANENCE can, before executing an action against a filesystem, git repo, or database, produce a calibrated prediction of how reversible that action is **given the current state of the world**. "Given the current state of the world" is doing a lot of work here — and it is the central reason this is an RL problem.
+![Confusion matrix](results/confusion_matrix.png)
+*Prediction accuracy on the RL-trained policy over 34 valid held-out scenarios. Every R2 action is correctly predicted R2; every R5 action is correctly predicted R5. Zero catastrophic miscalls across the full evaluation and all 1 200 training episodes.*
+---
+## The environment
+The environment is built around three operational-semantics simulators:
+- `MockFS` for filesystem actions and recovery layers such as trash and backups
+- `MockGitRepo` for git actions, reflog-style recovery, and clone-based recovery
+- `MockDatabase` for database actions, snapshots, WAL, and transaction-based rollback
+The same action id can land at different reversibility levels depending on the world state. That means the model has to read context, not just memorise that one command is dangerous.
+---
+## The training signal
+We train with a four-part OpenEnv rubric:
+- task completion
+- prediction calibration
+- option preservation
+- catastrophe avoidance
+That reward is the same signal used at evaluation time, so the model is trained on the behaviour it is actually judged on. The training recipe uses supervised warmup, a format gate, GRPO, and held-out evaluation.
+![Reward comparison](results/reward_comparison.png)
+*Scripted baseline, supervised warmup, and RL-trained policy on the same held-out seeds.*
+![Training reward curve](results/training_reward_curve.png)
+*Per-episode reward during policy optimisation, with the curriculum phasing in harder scenarios over time.*
+---
+## Why this matters
+This is a small but concrete example of an RL environment for agent safety and operational reasoning. The interesting part is not that the agent learns to say "dangerous". The interesting part is that it learns to condition that judgment on recovery state, because the same command can be reversible in one state and irreversible in another.
+---
+## Honest limits
+The evaluation distribution is intentionally narrow. It is strongest on the R2 and R5 cases that the scenario generator surfaces most often, and it does not claim broad generalisation across unrelated domains. That is the point of the writeup: show what was trained, what was measured, and where the evidence stops.
+---
+## Reproduce
+```bash
+python training/generate_warmup_traces.py
+python -m training.pipeline --config training/config.yaml
+python tools/render_results.py
+```
+The repo also includes a Colab quickstart in `notebooks/train_grpo_colab.ipynb`.

README.md CHANGED Viewed

@@ -20,7 +20,7 @@ tags:
 🔗 **Live environment** — https://chane35-permanence.hf.space
 🔗 **Training workspace** — https://chane35-permanence-training.hf.space
 🔗 **Artifacts** — https://huggingface.co/datasets/chane35/permanence-artifacts
-🔗 **Blog post** — [`docs/BLOG_POST.md`](docs/BLOG_POST.md)
 🔗 **Architecture deep-dive** — [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)
 🔗 **Training methods** — [`docs/METHODS.md`](docs/METHODS.md)
 🔗 **Full results** — [`docs/RESULTS.md`](docs/RESULTS.md)

 🔗 **Live environment** — https://chane35-permanence.hf.space
 🔗 **Training workspace** — https://chane35-permanence-training.hf.space
 🔗 **Artifacts** — https://huggingface.co/datasets/chane35/permanence-artifacts
+🔗 **Blog post** — [`Blog.md`](Blog.md)
 🔗 **Architecture deep-dive** — [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)
 🔗 **Training methods** — [`docs/METHODS.md`](docs/METHODS.md)
 🔗 **Full results** — [`docs/RESULTS.md`](docs/RESULTS.md)