chane35 commited on
Commit
e5ed352
Β·
verified Β·
1 Parent(s): f24e2c7

PERMANENCE training: 4-stage SFT -> gate -> GRPO -> eval pipeline

Browse files
Files changed (2) hide show
  1. Blog.md +80 -0
  2. README.md +1 -1
Blog.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "PERMANENCE: teaching language-model agents to recognise irreversible actions"
3
+ thumbnail: results/confusion_matrix.png
4
+ authors:
5
+ - user: chane35
6
+ tags: [openenv, rl, world-modeling, agent-safety]
7
+ ---
8
+
9
+ # PERMANENCE: teaching language-model agents to recognise irreversible actions
10
+
11
+ The most expensive bugs in agentic LLM deployments are not hallucinations. They are well-formed, syntactically correct, confidently executed actions against production state that cannot be undone. `rm -rf` the wrong directory. `git push --force` over a teammate's commit. `DROP TABLE` with no snapshot. The model is not confused about what these commands do β€” it just never learned that some commands, in some states, leave no way back.
12
+
13
+ **PERMANENCE** is an OpenEnv environment and training recipe that treats this capability gap as the objective, not as a symptom.
14
+
15
+ ---
16
+
17
+ ## The claim
18
+
19
+ A language model trained with PERMANENCE can, before executing an action against a filesystem, git repo, or database, produce a calibrated prediction of how reversible that action is **given the current state of the world**. "Given the current state of the world" is doing a lot of work here β€” and it is the central reason this is an RL problem.
20
+
21
+ ![Confusion matrix](results/confusion_matrix.png)
22
+
23
+ *Prediction accuracy on the RL-trained policy over 34 valid held-out scenarios. Every R2 action is correctly predicted R2; every R5 action is correctly predicted R5. Zero catastrophic miscalls across the full evaluation and all 1 200 training episodes.*
24
+
25
+ ---
26
+
27
+ ## The environment
28
+
29
+ The environment is built around three operational-semantics simulators:
30
+
31
+ - `MockFS` for filesystem actions and recovery layers such as trash and backups
32
+ - `MockGitRepo` for git actions, reflog-style recovery, and clone-based recovery
33
+ - `MockDatabase` for database actions, snapshots, WAL, and transaction-based rollback
34
+
35
+ The same action id can land at different reversibility levels depending on the world state. That means the model has to read context, not just memorise that one command is dangerous.
36
+
37
+ ---
38
+
39
+ ## The training signal
40
+
41
+ We train with a four-part OpenEnv rubric:
42
+
43
+ - task completion
44
+ - prediction calibration
45
+ - option preservation
46
+ - catastrophe avoidance
47
+
48
+ That reward is the same signal used at evaluation time, so the model is trained on the behaviour it is actually judged on. The training recipe uses supervised warmup, a format gate, GRPO, and held-out evaluation.
49
+
50
+ ![Reward comparison](results/reward_comparison.png)
51
+
52
+ *Scripted baseline, supervised warmup, and RL-trained policy on the same held-out seeds.*
53
+
54
+ ![Training reward curve](results/training_reward_curve.png)
55
+
56
+ *Per-episode reward during policy optimisation, with the curriculum phasing in harder scenarios over time.*
57
+
58
+ ---
59
+
60
+ ## Why this matters
61
+
62
+ This is a small but concrete example of an RL environment for agent safety and operational reasoning. The interesting part is not that the agent learns to say "dangerous". The interesting part is that it learns to condition that judgment on recovery state, because the same command can be reversible in one state and irreversible in another.
63
+
64
+ ---
65
+
66
+ ## Honest limits
67
+
68
+ The evaluation distribution is intentionally narrow. It is strongest on the R2 and R5 cases that the scenario generator surfaces most often, and it does not claim broad generalisation across unrelated domains. That is the point of the writeup: show what was trained, what was measured, and where the evidence stops.
69
+
70
+ ---
71
+
72
+ ## Reproduce
73
+
74
+ ```bash
75
+ python training/generate_warmup_traces.py
76
+ python -m training.pipeline --config training/config.yaml
77
+ python tools/render_results.py
78
+ ```
79
+
80
+ The repo also includes a Colab quickstart in `notebooks/train_grpo_colab.ipynb`.
README.md CHANGED
@@ -20,7 +20,7 @@ tags:
20
  πŸ”— **Live environment** β€” https://chane35-permanence.hf.space
21
  πŸ”— **Training workspace** β€” https://chane35-permanence-training.hf.space
22
  πŸ”— **Artifacts** β€” https://huggingface.co/datasets/chane35/permanence-artifacts
23
- πŸ”— **Blog post** β€” [`docs/BLOG_POST.md`](docs/BLOG_POST.md)
24
  πŸ”— **Architecture deep-dive** β€” [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)
25
  πŸ”— **Training methods** β€” [`docs/METHODS.md`](docs/METHODS.md)
26
  πŸ”— **Full results** β€” [`docs/RESULTS.md`](docs/RESULTS.md)
 
20
  πŸ”— **Live environment** β€” https://chane35-permanence.hf.space
21
  πŸ”— **Training workspace** β€” https://chane35-permanence-training.hf.space
22
  πŸ”— **Artifacts** β€” https://huggingface.co/datasets/chane35/permanence-artifacts
23
+ πŸ”— **Blog post** β€” [`Blog.md`](Blog.md)
24
  πŸ”— **Architecture deep-dive** β€” [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md)
25
  πŸ”— **Training methods** β€” [`docs/METHODS.md`](docs/METHODS.md)
26
  πŸ”— **Full results** β€” [`docs/RESULTS.md`](docs/RESULTS.md)