arxiv:2602.08847

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

Published on Feb 9

· Submitted by

Lang Feng on Feb 11

Nanyang Technological University

Upvote

Authors:

Lang Feng ,

Longtao Zheng ,

Fuxiang Zhang ,

Abstract

Multi-agent large language model systems face training instability in reinforcement learning due to global normalization mismatches, which is addressed by Dr. MAS through agent-specific advantage normalization and enhanced training stability.

AI-generated summary

Multi-agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post-training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agent LLM systems. We show that under GRPO-style optimization, a global normalization baseline may deviate from diverse agents' reward distributions, which ultimately leads to gradient-norm instability. Based on this finding, we propose Dr. MAS, a simple and stable RL training recipe for multi-agent LLM systems. Dr. MAS uses an agent-wise remedy: normalizing advantages per agent using each agent's own reward statistics, which calibrates gradient scales and dramatically stabilizes training, both theoretically and empirically. Beyond the algorithm, Dr. MAS provides an end-to-end RL training framework for multi-agent LLM systems, supporting scalable orchestration, flexible per-agent LLM serving and optimization configs, and shared resource scheduling of LLM actor backends. We evaluate Dr. MAS on multi-agent math reasoning and multi-turn search benchmarks using Qwen2.5 and Qwen3 series models. Dr. MAS achieves clear gains over vanilla GRPO (e.g., +5.6\% avg@16 and +4.6\% pass@16 on math, and +15.2\% avg@16 and +13.1\% pass@16 on search) while largely eliminating gradient spikes. Moreover, it remains highly effective under heterogeneous agent-model assignments while improving efficiency.

View arXiv page View PDF GitHub 46 Add to collection

Community

langfeng01

Paper author Paper submitter about 12 hours ago

Dr. MAS is designed for stable end-to-end RL post-training 🔥 of multi-agent LLM systems. It enables agents to collaborate on complex reasoning tasks with:
✨ Flexible agent registry & multi-agent orchestration ✨ Heterogeneous LLMs (shared/non-shared) ✨ Co-training of multiple agents✨ Efficient resource pooling.

In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agent LLM systems. We show that under GRPO-style optimization, a global normalization baseline may lead to gradient-norm instability. Based on this finding, Dr. MAS propose a simple yet effective remedy which calibrates gradient scales and dramatically stabilizes training.