permanence / docs /BLOG_POST_TEMPLATE.md
chane335's picture
Upload folder using huggingface_hub
84a3b72 verified
metadata
title: 'PERMANENCE: Training Agents to Understand Irreversible Actions'
description: >-
  The first RL environment where consequences don't reset. Early choices lock
  downstream options. Agents learn to predict irreversibility *before* acting,
  not after.
tags:
  - openenv
  - reinforcement-learning
  - world-modeling
  - llm-agents
thumbnail: results/reward_curve.png

PERMANENCE: The First Environment Where the World Remembers

The Problem: Agents Don't Understand Permanence

Every large language model agent trained on reinforcement learning today operates in an illusion.

In every training environment β€” whether it's game-playing, code generation, or dialogue β€” the world resets. An agent acts. It gets a reward. The environment returns to its starting state. The consequence disappears.

But in the real world, some actions do not reset. A message sent to an external stakeholder cannot be unsent. A personnel decision creates a permanent record. A public commitment constrains all future communication. A system change may corrupt data that cannot be recovered.

The consequence is structural: Agents trained only on resetting environments receive zero signal for distinguishing reversible from irreversible actions. They have no training basis for treating "send internal memo" (reversible) differently from "issue public statement" (irreversible), because in every environment they've trained in, both consequences eventually disappear.

When deployed in real systems, this manifests as real failure modes:

  • Taking irreversible commitments without proper preparation
  • Misclassifying high-impact actions as low-risk
  • Cascade lockouts where one premature action blocks all recovery paths
  • Policies that either over-avoid action (seeking zero irreversible moves) or under-recognize irreversibility

The Solution: PERMANENCE

PERMANENCE is the first OpenEnv environment where consequences within an episode are permanent. World state persists across steps. Early actions constrain what is possible later. Some actions lock downstream options entirely.

The agent does not simply take actions and observe rewards. The agent must predict reversibility before acting β€” output a formal prediction of how reversible the action is, then execute it. The environment scores both the prediction accuracy and the action outcome.

The Core Mechanics

1. Persistent World State

  • Each episode has a 15-step horizon.
  • Actions update organizational state: employee relationships, project status, external trust, board opinion.
  • State persists. An action in step 3 constrains options in step 14.

2. Context-Dependent Reversibility Levels

  • Reversibility is not a static tag on actions. It is computed at execution time from current world state.
  • The same action type has different reversibility in different contexts:
    • R1 (Instantly Reversible): Schedule a meeting, draft a memo
    • R2 (Cheap to Undo): Internal communication, request delay
    • R3 (Costly to Undo): Board briefing, staged rollout
    • R4 (Conditionally Irreversible): External commitment, public statement
    • R5 (Permanently Irreversible): Termination, full system launch

3. Prediction-First Interface The agent outputs:

<thinking>
[reasoning about the situation and reversibility implications]
</thinking>
<action id="send_communication" recipient="board" content="positive_outlook"/>
<reversibility level="R4" confidence="0.87"/>

The environment:

  • Computes ground-truth reversibility from current world state
  • Scores the prediction vs. ground truth (penalizing overconfidence in wrong predictions)
  • Scores task completion
  • Scores option preservation (how many valuable downstream actions remain available)

4. Catastrophe Awareness The reward function includes an asymmetric penalty when an agent takes a high-irreversibility action (R4–R5) while predicting low reversibility (R1–R2). This captures the real-world cost structure: misclassifying an irreversible action as reversible is not a symmetric error.


The Five Tasks (Curriculum)

Task 1: Report Correction (Difficulty 1)

Manage communication of an internal report error without creating unnecessary permanent external effects. Can the agent distinguish between internal memo corrections (R2) and external stakeholder notifications (R4)?

Task 2: Personnel Conflict (Difficulty 2)

Resolve a team conflict with an intervention level proportional to context. Escalating to HR creates a permanent record (R4). Can the agent choose the right level?

Task 3: Product Launch (Difficulty 3)

Choose between full product launch, staged rollout, or strategic delay under deadline pressure. Each option has different reversibility. Can the agent reason about the tradeoffs?

Task 4: Crisis Response (Difficulty 4)

Mandatory public response to a crisis under board scrutiny. The agent cannot avoid action. It must issue a public statement (R4 action) while maintaining credibility. Over-caution is penalized as equally as recklessness. This forces genuine judgment rather than risk-avoidance.

Task 5: Cascade Resolution (Difficulty 5 β€” THE DEMO)

A multi-step dispute resolution scenario where taking step 3 before completing steps 1–2 permanently locks steps 4–6. The agent must reason about dependencies and sequencing. One wrong move closes all recovery paths. This is the visual centerpiece of the pitch.


Training Results

We trained a Llama 3.2 3B Instruct agent on PERMANENCE using GRPO + Unsloth for [TRAINING TIME] on [GPU TYPE].

Before Training (Random Policy)

  • Catastrophe Rate: 43% of episodes included at least one severe misclassification
  • Prediction Accuracy: 31% (nearly random)
  • Option Preservation: 38% of valuable downstream actions remained available
  • Episode Reward: -0.42 (agents avoided action; tasks failed)

After Training

  • Catastrophe Rate: 8% (↓ 81%)
  • Prediction Accuracy: 74% (↑ 139%)
  • Option Preservation: 71% (↑ 87%)
  • Episode Reward: +0.61 (↑ 147%)

Learning Curves

[Reward curve showing training progress] [Loss curve showing convergence] [Catastrophe rate showing improvement] [Prediction accuracy showing calibration]


Why This Matters

For LLM Deployment

Real-world agent deployment requires understanding permanent consequences. PERMANENCE trains this understanding at scale β€” 1,500 episodes of reasoning about reversibility, option preservation, and the asymmetric costs of misclassification.

For Reinforcement Learning Research

PERMANENCE demonstrates that consequence persistence within an episode is both:

  • Technically viable: The reward function does not have pathological local optima; agents learn genuine judgment, not just caution
  • Necessary: Agents trained only on resetting environments cannot transfer this capability to deployment contexts

For OpenEnv Framework

PERMANENCE is the first environment in the OpenEnv ecosystem to implement within-episode persistent state with runtime-computed reward levels. This opens a new category of training scenarios beyond stateless Markov environments.


The Demo

Watch the trained agent tackle the Cascade task in real time. The agent:

  1. Analyzes the situation and hidden dependencies
  2. Predicts reversibility for each proposed action
  3. Takes actions in the correct sequence
  4. Recovers from pressure to act prematurely

Contrast with untrained baseline: the untrained agent either freezes (catastrophe avoidance) or acts recklessly (misclassification).


Get Started

The environment is available on OpenEnv + HuggingFace Spaces:

Try training it on your own data. The code is modular β€” reward functions, task bank, and world engine are all composable.


Appendix: Metrics Explained

  • Catastrophe Rate: Episodes where agent R_predicted ≀ 2 but R_actual β‰₯ 4 at any step
  • Prediction Accuracy: Mean absolute error between predicted R-level and actual R-level, scored with confidence calibration
  • Option Preservation: Fraction of critical downstream actions still available at episode end
  • Episode Reward: 0.40 * task_score + 0.30 * prediction_score + 0.20 * preservation_score - 0.10 * catastrophe_penalty (capped at 4.0)