--- license: mit tags: - safe-multi-agent-reinforcement-learning - constrained-optimization - multi-agent-mujoco - responsibility-decomposition - mappo-lagrangian - corl-2026 language: en library_name: pytorch pipeline_tag: reinforcement-learning --- # Learning Safety Burden Allocation by Equilibrium Response This repository contains trained model checkpoints for the paper: > **Learning Safety Burden Allocation by Equilibrium Response** > Xiaoyang Cao > CoRL 2026 ## Model Description We propose a principled method for decomposing a joint safety constraint into per-agent responsibility shares in cooperative multi-agent reinforcement learning. Rather than using uniform splits or hand-tuned allocations, our approach **learns state-dependent responsibility vectors** via bilevel optimization, where the outer loop adjusts each agent's safety burden share (rho) to maximize team welfare, while the inner loop trains policies under MAPPO-Lagrangian with the assigned per-agent constraints. The key insight is that different agents have different capacities to avoid costs at different states. By letting rho adapt to the state, agents who can cheaply reduce cost take on more safety burden, freeing capable agents to maximize reward. ### Architecture - **Inner loop**: MAPPO-Lagrangian (Multi-Agent PPO with Lagrangian constraint handling) - Shared actor-critic networks (MLP, 2 layers, 256 hidden units) - Per-agent Lagrange multiplier for constraint enforcement - **Outer loop**: State-dependent rho network - MLP mapping joint state to per-agent responsibility shares - Softmax output to ensure shares sum to 1 - Updated via implicit differentiation through the inner equilibrium ## Environments | Environment | Agents | Description | Episodes | |---|---|---|---| | HalfCheetah 6x1 | 6 | Each agent controls 1 joint of a 6-joint cheetah | 1500 | | Humanoid 9\|8 | 2 | Two agents control 9 and 8 joints respectively | 1000 | | ManySegmentSwimmer 6x1 | 6 | Each agent controls 1 segment of a 6-segment swimmer | 1500 | | Resource Harvest | 5 | Agents harvest resources with a shared sustainability constraint | 1500 | ## Methods - **Baseline (MAPPO-Lag)**: Standard MAPPO-Lagrangian with uniform safety budget split (rho_i = 1/n) - **PAL (Per-Agent Lambda)**: Each agent maintains its own Lagrange multiplier against the joint constraint - **Ours (Learned rho)**: State-dependent responsibility decomposition via bilevel optimization ## Checkpoint Structure ``` checkpoints/ halfcheetah_6x1/ # 6-agent HalfCheetah baseline_s{0..4}.pt # 5 seeds, MAPPO-Lag uniform split pal_s{0..4}.pt # 5 seeds, per-agent lambda ours_s{0..4}.pt # 5 seeds, learned state-dep rho humanoid_9_8/ # 2-agent Humanoid (9|8 partition) baseline_s{0..4}.pt # 5 seeds pal_s{0..3}.pt # 4 seeds ours_s{0..4}.pt # 5 seeds mss_6x1/ # 6-agent ManySegmentSwimmer baseline_s{0..4}.pt # 5 seeds pal_s{0..4}.pt # 5 seeds ours_s{0..4}.pt # 5 seeds harvest_5/ # 5-agent Resource Harvest baseline_s{0..4}.pt # 5 seeds pal_s{0..4}.pt # 5 seeds ours_s{0..4}.pt # 5 seeds ablation_hc6x1/ # Ablations on HalfCheetah 6x1 cg3_s{0..2}.pt # Conjugate gradient steps = 3 cg5_s{0..2}.pt # Conjugate gradient steps = 5 trho25_s{0..2}.pt # Rho update period T_rho = 25 trho100_s{0..2}.pt # Rho update period T_rho = 100 ``` Each `.pt` file is a dictionary containing: - `actor_state_dict`: Policy network weights - `critic_state_dict`: Value network weights - `rho_net_state_dict` (ours only): Responsibility decomposition network - `lambda_value`: Final Lagrange multiplier(s) - `config`: Training hyperparameters - `final_metrics`: Final evaluation metrics (reward, cost, constraint satisfaction) ## How to Load and Evaluate ```python import torch from huggingface_hub import hf_hub_download # Download a checkpoint ckpt_path = hf_hub_download( repo_id="Sean13/responsibility-decomposition", filename="checkpoints/halfcheetah_6x1/ours_s0.pt", ) # Load checkpoint = torch.load(ckpt_path, map_location="cpu") print(checkpoint.keys()) # dict_keys(['actor_state_dict', 'critic_state_dict', 'rho_net_state_dict', # 'lambda_value', 'config', 'final_metrics']) # Inspect final metrics print(f"Welfare: {checkpoint['final_metrics']['total_welfare']:.1f}") print(f"Constraint satisfaction: {checkpoint['final_metrics']['constraint_satisfaction_pct']:.1f}%") print(f"Final rho: {checkpoint['final_metrics']['final_rho']}") ``` ### Run Evaluation ```bash # Clone this repo git clone https://huggingface.co/Sean13/responsibility-decomposition cd responsibility-decomposition # Evaluate a checkpoint python scripts/eval.py \ --checkpoint checkpoints/halfcheetah_6x1/ours_s0.pt \ --env mamujoco_HalfCheetah_6x1 \ --n_eval_episodes 100 # Render a video python scripts/render.py \ --checkpoint checkpoints/halfcheetah_6x1/ours_s0.pt \ --env mamujoco_HalfCheetah_6x1 \ --output_dir videos/ ``` ## Training Details ### Hyperparameters (HalfCheetah 6x1) | Parameter | Value | |---|---| | Algorithm | MAPPO-Lagrangian | | Learning rate (policy) | 5e-4 | | Learning rate (critic) | 5e-4 | | Learning rate (rho network) | 5e-3 | | Learning rate (lambda) | 1e-2 | | Hidden size | 256 | | Layers | 2 | | PPO clip | 0.2 | | GAE lambda | 0.95 | | Discount | 0.99 | | Episode length | 1000 | | Rollout threads | 10 | | Training episodes | 1500 | | Cost limit (per-step) | 1.0 | | Rho update period (T_rho) | 50 | | CG steps (implicit diff) | 10 | | Activation | ReLU | ### Compute All experiments were run on MIT ORCD cluster using NVIDIA L40s GPUs (single GPU per run). Approximate wall-clock times: - HalfCheetah 6x1: ~4 hours per seed - Humanoid 9|8: ~3 hours per seed - ManySegmentSwimmer 6x1: ~4 hours per seed - Resource Harvest: ~2 hours per seed Total compute: approximately 250 GPU-hours across all experiments and ablations. ## How to Reproduce from Scratch 1. Install the training codebase: ```bash git clone https://github.com/YOUR_USERNAME/corl_respon_vector cd corl_respon_vector/macpo_base/MAPPO-Lagrangian pip install -e . ``` 2. Run training (example: HalfCheetah 6x1, ours, seed 0): ```bash python mappo_lagrangian/scripts/train/train_mujoco.py \ --env_name mujoco \ --scenario HalfCheetah-v2 \ --agent_conf 6x1 \ --algorithm_name mappo \ --seed 0 \ --n_rollout_threads 10 \ --num_env_steps 15000000 \ --episode_length 1000 \ --hidden_size 256 \ --layer_N 2 \ --lr 5e-4 \ --critic_lr 5e-4 \ --use_responsibility_decomposition \ --rho_mode state_dependent \ --lr_rho 5e-3 \ --rho_update_period 50 \ --cg_steps 10 \ --cost_limit 1.0 ``` See `configs/` for full per-environment configurations. ## Citation ```bibtex @inproceedings{cao2026responsibility, title={Learning Safety Burden Allocation by Equilibrium Response}, author={Cao, Xiaoyang}, booktitle={Conference on Robot Learning (CoRL)}, year={2026} } ``` ## License This work is released under the [MIT License](https://opensource.org/licenses/MIT). ## Acknowledgments This research was supported by computational resources from the MIT Office of Research Computing and Data (ORCD).