File size: 7,423 Bytes
d645199 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 | ---
license: mit
tags:
- safe-multi-agent-reinforcement-learning
- constrained-optimization
- multi-agent-mujoco
- responsibility-decomposition
- mappo-lagrangian
- corl-2026
language: en
library_name: pytorch
pipeline_tag: reinforcement-learning
---
# Learning Safety Burden Allocation by Equilibrium Response
This repository contains trained model checkpoints for the paper:
> **Learning Safety Burden Allocation by Equilibrium Response**
> Xiaoyang Cao
> CoRL 2026
## Model Description
We propose a principled method for decomposing a joint safety constraint into per-agent responsibility shares in cooperative multi-agent reinforcement learning. Rather than using uniform splits or hand-tuned allocations, our approach **learns state-dependent responsibility vectors** via bilevel optimization, where the outer loop adjusts each agent's safety burden share (rho) to maximize team welfare, while the inner loop trains policies under MAPPO-Lagrangian with the assigned per-agent constraints.
The key insight is that different agents have different capacities to avoid costs at different states. By letting rho adapt to the state, agents who can cheaply reduce cost take on more safety burden, freeing capable agents to maximize reward.
### Architecture
- **Inner loop**: MAPPO-Lagrangian (Multi-Agent PPO with Lagrangian constraint handling)
- Shared actor-critic networks (MLP, 2 layers, 256 hidden units)
- Per-agent Lagrange multiplier for constraint enforcement
- **Outer loop**: State-dependent rho network
- MLP mapping joint state to per-agent responsibility shares
- Softmax output to ensure shares sum to 1
- Updated via implicit differentiation through the inner equilibrium
## Environments
| Environment | Agents | Description | Episodes |
|---|---|---|---|
| HalfCheetah 6x1 | 6 | Each agent controls 1 joint of a 6-joint cheetah | 1500 |
| Humanoid 9\|8 | 2 | Two agents control 9 and 8 joints respectively | 1000 |
| ManySegmentSwimmer 6x1 | 6 | Each agent controls 1 segment of a 6-segment swimmer | 1500 |
| Resource Harvest | 5 | Agents harvest resources with a shared sustainability constraint | 1500 |
## Methods
- **Baseline (MAPPO-Lag)**: Standard MAPPO-Lagrangian with uniform safety budget split (rho_i = 1/n)
- **PAL (Per-Agent Lambda)**: Each agent maintains its own Lagrange multiplier against the joint constraint
- **Ours (Learned rho)**: State-dependent responsibility decomposition via bilevel optimization
## Checkpoint Structure
```
checkpoints/
halfcheetah_6x1/ # 6-agent HalfCheetah
baseline_s{0..4}.pt # 5 seeds, MAPPO-Lag uniform split
pal_s{0..4}.pt # 5 seeds, per-agent lambda
ours_s{0..4}.pt # 5 seeds, learned state-dep rho
humanoid_9_8/ # 2-agent Humanoid (9|8 partition)
baseline_s{0..4}.pt # 5 seeds
pal_s{0..3}.pt # 4 seeds
ours_s{0..4}.pt # 5 seeds
mss_6x1/ # 6-agent ManySegmentSwimmer
baseline_s{0..4}.pt # 5 seeds
pal_s{0..4}.pt # 5 seeds
ours_s{0..4}.pt # 5 seeds
harvest_5/ # 5-agent Resource Harvest
baseline_s{0..4}.pt # 5 seeds
pal_s{0..4}.pt # 5 seeds
ours_s{0..4}.pt # 5 seeds
ablation_hc6x1/ # Ablations on HalfCheetah 6x1
cg3_s{0..2}.pt # Conjugate gradient steps = 3
cg5_s{0..2}.pt # Conjugate gradient steps = 5
trho25_s{0..2}.pt # Rho update period T_rho = 25
trho100_s{0..2}.pt # Rho update period T_rho = 100
```
Each `.pt` file is a dictionary containing:
- `actor_state_dict`: Policy network weights
- `critic_state_dict`: Value network weights
- `rho_net_state_dict` (ours only): Responsibility decomposition network
- `lambda_value`: Final Lagrange multiplier(s)
- `config`: Training hyperparameters
- `final_metrics`: Final evaluation metrics (reward, cost, constraint satisfaction)
## How to Load and Evaluate
```python
import torch
from huggingface_hub import hf_hub_download
# Download a checkpoint
ckpt_path = hf_hub_download(
repo_id="Sean13/responsibility-decomposition",
filename="checkpoints/halfcheetah_6x1/ours_s0.pt",
)
# Load
checkpoint = torch.load(ckpt_path, map_location="cpu")
print(checkpoint.keys())
# dict_keys(['actor_state_dict', 'critic_state_dict', 'rho_net_state_dict',
# 'lambda_value', 'config', 'final_metrics'])
# Inspect final metrics
print(f"Welfare: {checkpoint['final_metrics']['total_welfare']:.1f}")
print(f"Constraint satisfaction: {checkpoint['final_metrics']['constraint_satisfaction_pct']:.1f}%")
print(f"Final rho: {checkpoint['final_metrics']['final_rho']}")
```
### Run Evaluation
```bash
# Clone this repo
git clone https://huggingface.co/Sean13/responsibility-decomposition
cd responsibility-decomposition
# Evaluate a checkpoint
python scripts/eval.py \
--checkpoint checkpoints/halfcheetah_6x1/ours_s0.pt \
--env mamujoco_HalfCheetah_6x1 \
--n_eval_episodes 100
# Render a video
python scripts/render.py \
--checkpoint checkpoints/halfcheetah_6x1/ours_s0.pt \
--env mamujoco_HalfCheetah_6x1 \
--output_dir videos/
```
## Training Details
### Hyperparameters (HalfCheetah 6x1)
| Parameter | Value |
|---|---|
| Algorithm | MAPPO-Lagrangian |
| Learning rate (policy) | 5e-4 |
| Learning rate (critic) | 5e-4 |
| Learning rate (rho network) | 5e-3 |
| Learning rate (lambda) | 1e-2 |
| Hidden size | 256 |
| Layers | 2 |
| PPO clip | 0.2 |
| GAE lambda | 0.95 |
| Discount | 0.99 |
| Episode length | 1000 |
| Rollout threads | 10 |
| Training episodes | 1500 |
| Cost limit (per-step) | 1.0 |
| Rho update period (T_rho) | 50 |
| CG steps (implicit diff) | 10 |
| Activation | ReLU |
### Compute
All experiments were run on MIT ORCD cluster using NVIDIA L40s GPUs (single GPU per run). Approximate wall-clock times:
- HalfCheetah 6x1: ~4 hours per seed
- Humanoid 9|8: ~3 hours per seed
- ManySegmentSwimmer 6x1: ~4 hours per seed
- Resource Harvest: ~2 hours per seed
Total compute: approximately 250 GPU-hours across all experiments and ablations.
## How to Reproduce from Scratch
1. Install the training codebase:
```bash
git clone https://github.com/YOUR_USERNAME/corl_respon_vector
cd corl_respon_vector/macpo_base/MAPPO-Lagrangian
pip install -e .
```
2. Run training (example: HalfCheetah 6x1, ours, seed 0):
```bash
python mappo_lagrangian/scripts/train/train_mujoco.py \
--env_name mujoco \
--scenario HalfCheetah-v2 \
--agent_conf 6x1 \
--algorithm_name mappo \
--seed 0 \
--n_rollout_threads 10 \
--num_env_steps 15000000 \
--episode_length 1000 \
--hidden_size 256 \
--layer_N 2 \
--lr 5e-4 \
--critic_lr 5e-4 \
--use_responsibility_decomposition \
--rho_mode state_dependent \
--lr_rho 5e-3 \
--rho_update_period 50 \
--cg_steps 10 \
--cost_limit 1.0
```
See `configs/` for full per-environment configurations.
## Citation
```bibtex
@inproceedings{cao2026responsibility,
title={Learning Safety Burden Allocation by Equilibrium Response},
author={Cao, Xiaoyang},
booktitle={Conference on Robot Learning (CoRL)},
year={2026}
}
```
## License
This work is released under the [MIT License](https://opensource.org/licenses/MIT).
## Acknowledgments
This research was supported by computational resources from the MIT Office of Research Computing and Data (ORCD).
|