File size: 7,423 Bytes

d645199

---
license: mit
tags:
  - safe-multi-agent-reinforcement-learning
  - constrained-optimization
  - multi-agent-mujoco
  - responsibility-decomposition
  - mappo-lagrangian
  - corl-2026
language: en
library_name: pytorch
pipeline_tag: reinforcement-learning
---

# Learning Safety Burden Allocation by Equilibrium Response

This repository contains trained model checkpoints for the paper:

> **Learning Safety Burden Allocation by Equilibrium Response**
> Xiaoyang Cao
> CoRL 2026

## Model Description

We propose a principled method for decomposing a joint safety constraint into per-agent responsibility shares in cooperative multi-agent reinforcement learning. Rather than using uniform splits or hand-tuned allocations, our approach **learns state-dependent responsibility vectors** via bilevel optimization, where the outer loop adjusts each agent's safety burden share (rho) to maximize team welfare, while the inner loop trains policies under MAPPO-Lagrangian with the assigned per-agent constraints.

The key insight is that different agents have different capacities to avoid costs at different states. By letting rho adapt to the state, agents who can cheaply reduce cost take on more safety burden, freeing capable agents to maximize reward.

### Architecture

- **Inner loop**: MAPPO-Lagrangian (Multi-Agent PPO with Lagrangian constraint handling)
  - Shared actor-critic networks (MLP, 2 layers, 256 hidden units)
  - Per-agent Lagrange multiplier for constraint enforcement
- **Outer loop**: State-dependent rho network
  - MLP mapping joint state to per-agent responsibility shares
  - Softmax output to ensure shares sum to 1
  - Updated via implicit differentiation through the inner equilibrium

## Environments

| Environment | Agents | Description | Episodes |
|---|---|---|---|
| HalfCheetah 6x1 | 6 | Each agent controls 1 joint of a 6-joint cheetah | 1500 |
| Humanoid 9\|8 | 2 | Two agents control 9 and 8 joints respectively | 1000 |
| ManySegmentSwimmer 6x1 | 6 | Each agent controls 1 segment of a 6-segment swimmer | 1500 |
| Resource Harvest | 5 | Agents harvest resources with a shared sustainability constraint | 1500 |

## Methods

- **Baseline (MAPPO-Lag)**: Standard MAPPO-Lagrangian with uniform safety budget split (rho_i = 1/n)
- **PAL (Per-Agent Lambda)**: Each agent maintains its own Lagrange multiplier against the joint constraint
- **Ours (Learned rho)**: State-dependent responsibility decomposition via bilevel optimization

## Checkpoint Structure

```
checkpoints/
  halfcheetah_6x1/          # 6-agent HalfCheetah
    baseline_s{0..4}.pt     # 5 seeds, MAPPO-Lag uniform split
    pal_s{0..4}.pt          # 5 seeds, per-agent lambda
    ours_s{0..4}.pt         # 5 seeds, learned state-dep rho
  humanoid_9_8/              # 2-agent Humanoid (9|8 partition)
    baseline_s{0..4}.pt     # 5 seeds
    pal_s{0..3}.pt          # 4 seeds
    ours_s{0..4}.pt         # 5 seeds
  mss_6x1/                   # 6-agent ManySegmentSwimmer
    baseline_s{0..4}.pt     # 5 seeds
    pal_s{0..4}.pt          # 5 seeds
    ours_s{0..4}.pt         # 5 seeds
  harvest_5/                  # 5-agent Resource Harvest
    baseline_s{0..4}.pt     # 5 seeds
    pal_s{0..4}.pt          # 5 seeds
    ours_s{0..4}.pt         # 5 seeds
  ablation_hc6x1/            # Ablations on HalfCheetah 6x1
    cg3_s{0..2}.pt          # Conjugate gradient steps = 3
    cg5_s{0..2}.pt          # Conjugate gradient steps = 5
    trho25_s{0..2}.pt       # Rho update period T_rho = 25
    trho100_s{0..2}.pt      # Rho update period T_rho = 100
```

Each `.pt` file is a dictionary containing:
- `actor_state_dict`: Policy network weights
- `critic_state_dict`: Value network weights
- `rho_net_state_dict` (ours only): Responsibility decomposition network
- `lambda_value`: Final Lagrange multiplier(s)
- `config`: Training hyperparameters
- `final_metrics`: Final evaluation metrics (reward, cost, constraint satisfaction)

## How to Load and Evaluate

```python
import torch
from huggingface_hub import hf_hub_download

# Download a checkpoint
ckpt_path = hf_hub_download(
    repo_id="Sean13/responsibility-decomposition",
    filename="checkpoints/halfcheetah_6x1/ours_s0.pt",
)

# Load
checkpoint = torch.load(ckpt_path, map_location="cpu")
print(checkpoint.keys())
# dict_keys(['actor_state_dict', 'critic_state_dict', 'rho_net_state_dict',
#            'lambda_value', 'config', 'final_metrics'])

# Inspect final metrics
print(f"Welfare: {checkpoint['final_metrics']['total_welfare']:.1f}")
print(f"Constraint satisfaction: {checkpoint['final_metrics']['constraint_satisfaction_pct']:.1f}%")
print(f"Final rho: {checkpoint['final_metrics']['final_rho']}")
```

### Run Evaluation

```bash
# Clone this repo
git clone https://huggingface.co/Sean13/responsibility-decomposition
cd responsibility-decomposition

# Evaluate a checkpoint
python scripts/eval.py \
    --checkpoint checkpoints/halfcheetah_6x1/ours_s0.pt \
    --env mamujoco_HalfCheetah_6x1 \
    --n_eval_episodes 100

# Render a video
python scripts/render.py \
    --checkpoint checkpoints/halfcheetah_6x1/ours_s0.pt \
    --env mamujoco_HalfCheetah_6x1 \
    --output_dir videos/
```

## Training Details

### Hyperparameters (HalfCheetah 6x1)

| Parameter | Value |
|---|---|
| Algorithm | MAPPO-Lagrangian |
| Learning rate (policy) | 5e-4 |
| Learning rate (critic) | 5e-4 |
| Learning rate (rho network) | 5e-3 |
| Learning rate (lambda) | 1e-2 |
| Hidden size | 256 |
| Layers | 2 |
| PPO clip | 0.2 |
| GAE lambda | 0.95 |
| Discount | 0.99 |
| Episode length | 1000 |
| Rollout threads | 10 |
| Training episodes | 1500 |
| Cost limit (per-step) | 1.0 |
| Rho update period (T_rho) | 50 |
| CG steps (implicit diff) | 10 |
| Activation | ReLU |

### Compute

All experiments were run on MIT ORCD cluster using NVIDIA L40s GPUs (single GPU per run). Approximate wall-clock times:
- HalfCheetah 6x1: ~4 hours per seed
- Humanoid 9|8: ~3 hours per seed
- ManySegmentSwimmer 6x1: ~4 hours per seed
- Resource Harvest: ~2 hours per seed

Total compute: approximately 250 GPU-hours across all experiments and ablations.

## How to Reproduce from Scratch

1. Install the training codebase:
```bash
git clone https://github.com/YOUR_USERNAME/corl_respon_vector
cd corl_respon_vector/macpo_base/MAPPO-Lagrangian
pip install -e .
```

2. Run training (example: HalfCheetah 6x1, ours, seed 0):
```bash
python mappo_lagrangian/scripts/train/train_mujoco.py \
    --env_name mujoco \
    --scenario HalfCheetah-v2 \
    --agent_conf 6x1 \
    --algorithm_name mappo \
    --seed 0 \
    --n_rollout_threads 10 \
    --num_env_steps 15000000 \
    --episode_length 1000 \
    --hidden_size 256 \
    --layer_N 2 \
    --lr 5e-4 \
    --critic_lr 5e-4 \
    --use_responsibility_decomposition \
    --rho_mode state_dependent \
    --lr_rho 5e-3 \
    --rho_update_period 50 \
    --cg_steps 10 \
    --cost_limit 1.0
```

See `configs/` for full per-environment configurations.

## Citation

```bibtex
@inproceedings{cao2026responsibility,
  title={Learning Safety Burden Allocation by Equilibrium Response},
  author={Cao, Xiaoyang},
  booktitle={Conference on Robot Learning (CoRL)},
  year={2026}
}
```

## License

This work is released under the [MIT License](https://opensource.org/licenses/MIT).

## Acknowledgments

This research was supported by computational resources from the MIT Office of Research Computing and Data (ORCD).