MATPO-14b / README.md
nielsr's picture
nielsr HF Staff
Improve model card: Add pipeline_tag, library_name, and paper link
28bd48a verified
|
raw
history blame
13.5 kB
metadata
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers

MATPO: Multi-Agent Tool-Integrated Policy Optimization

Train Multiple Agent Roles Within a Single LLM via Reinforcement Learning.

This model is presented in the paper Multi-Agent Tool-Integrated Policy Optimization.

Models Data Paper Github

GAIA Results
GAIA Results
FRAMES Results
FRAMES Results
WebWalkerQA Results
WebWalkerQA Results

MATPO Framework

MATPO allows planner and worker agents to coexist within a single LLM and be trained via RL, achieving an 18.38% relative improvement over single-agent baselines on GAIA-text, FRAMES, and WebWalker-QA.

News & Updates

  • [2025-Oct-08] MATPO-Qwen3-14B checkpoints and rollouts released
  • [2025-Oct-08] Code and training scripts released
  • [2025-Oct-06] Arxiv Paper released

Overview

MATPO (Multi-Agent Tool-Integrated Policy Optimization) is a novel reinforcement learning framework that enables training multiple specialized agent roles (planner and worker agents) within a single large language model.

The Problem

Current single-agent approaches for multi-turn tool-integrated planning face critical limitations:

  • Context Length Bottleneck: Tool responses (e.g., web scraping) consume excessive tokens, making long-range planning prohibitive
  • Noisy Tool Responses: Raw tool responses interfere with the model's attention and planning capabilities

Our Solution

MATPO introduces a multi-agent-in-one-model architecture where:

  • A planner-agent orchestrates high-level planning and delegates subtasks
  • Worker-agents handle specific browsing and search tasks with isolated contexts
  • Both roles are trained within a single LLM using role-specific prompts via reinforcement learning

Key Features

  • Multi-Agent-in-One-Model: Train planner and worker agents within a single LLM using role-specific system prompts
  • Principled Credit Assignment: Extends GRPO with theoretically grounded reward distribution across planner and worker rollouts
  • Easy Integration: Built on top of veRL, compatible with existing RL training frameworks
  • Robust Training: More stable learning curves compared to single-agent approaches, especially with noisy tool responses
  • Infrastructure Efficient: No need for deployment of separate models or additional rollout engines

MATPO Architecture

MATPO employs a hierarchical multi-agent framework where a single LLM serves multiple roles:

User Query β†’ Planner Agent β†’ Subtask 1 β†’ Worker Agent β†’ Result 1
                           β†’ Subtask 2 β†’ Worker Agent β†’ Result 2
                           β†’ ...
                           β†’ Final Answer

Single-agent GRPO Framework MATPO Framework

Comparison between the rollout trajectories between the single-agent GRPO (top) and the multi-agent MATPO (bottom).

Multi-Agent Rollout Process

  1. Planner Agent:

    • Receives user query with planner-specific system prompt
    • Generates high-level plan and decomposes it into subtasks
    • Delegates subtasks to worker agents
    • Synthesizes worker responses into final answer
  2. Worker Agent:

    • Receives subtask with worker-specific system prompt
    • Performs multi-turn tool-integrated planning (search, scrape, analyze)
    • Returns summarized result to planner
    • Maintains isolated context to prevent token overflow
  3. Credit Assignment:

    • Final answer accuracy determines the reward
    • Reward is normalized across all planner-worker rollout groups
    • Gradient flows to both planner actions and worker actions proportionally

MATPO Framework

Visualization of MATPO implementation.

Quick Start

Prerequisites:

  • Python 3.10 or higher
  • CUDA 12.4+ (for GPU support)
  • 16 x (8 x 80G-A800) GPUs (for training with Qwen3-14B-base)

Clone the repository.

git clone https://github.com/mzf666/MATPO.git
cd MATPO

For prerequisites installation (CUDA, cuDNN, Apex), we recommend following the verl prerequisites guide which provides detailed instructions for:

  • CUDA: Version >= 12.4
  • cuDNN: Version >= 9.8.0
  • Apex

Setup environment and install dependencies.

conda create -n matpo python==3.10 -y
conda activate matpo
bash examples/sglang_multiturn/install.sh

Setup Node.js for Serper API support.

MCP (Model Context Protocol) requires Node.js to run MCP servers. Node.js version 18+ is recommended for optimal compatibility with MCP tools.

target_path=YOUR_TARGET_PATH

# Download Node.js binary (example for Linux x64)
wget https://nodejs.org/dist/v24.2.0/node-v24.2.0-linux-x64.tar.xz

# Extract to your target path
tar -xf node-v24.2.0-linux-x64.tar.xz -C $target_path

# Add to PATH
export NODEJS_HOME=$target_path/node-v24.2.0-linux-x64
export PATH=$NODEJS_HOME/bin:$PATH
export NODE_SHARED=$target_path/node-shared/node_modules
export PATH=$NODE_SHARED/.bin:$PATH

# Verify installation
node --version
npm --version

# Install serper mcp server
mkdir -p $target_path/node-shared
cd $target_path/node-shared
npm init -y
npm install serper-search-scrape-mcp-server

Configure the Node.js paths and HTTP / HTTPS proxies (if necessary) in the examples/sglang_multiturn/launch.sh script properly.

Download the training and testing datasets to the data directory. The prerpocessed datasets can be downloaded here.

Train a Qwen3-14B-base model with MATPO on the MuSiQue dataset and evaluate on the GAIA-text datasets:

# tested on 16 x (8 x 80G-A800) nodes

export SERPER_API_KEY="YOUR_SERPER_API_KEY" && \
export OPENAI_API_KEY="YOUR_OPENAI_API_KEY" && \
export WANDB_API_KEY="YOUR_WANDB_API_KEY" && \
export SINGLENODE=true && \
export RAY_DEBUG=legacy && \
export HYDRA_FULL_ERROR=1 && \
source YOUR_CONDA_PATH activate matpo && \
cd YOUR_PROJECT_PATH && \
bash examples/sglang_multiturn/launch.sh \
    examples/sglang_multiturn/qwen3-14b_musique_MATPO.sh

Experiments and Results

Main Results

MATPO consistently outperforms single-agent GRPO baselines across all benchmarks:

Method GAIA-text WebWalkerQA FRAMES Relative Average Improvement
Single-Agent GRPO 32.16% 30.14% 56.22% -
MATPO (Ours) 42.60% 33.00% 63.64% +18.38%

Training Configuration

  • Base Model: Qwen3-14B-base
  • Training Dataset: Filtered MuSiQue dataset.
  • Training Steps: 180 steps
  • Rollouts per Query: 8 (for group normalization)
  • Reward Function: 0.9 Γ— accuracy + 0.1 Γ— tool_format_reward

Model Checkpoints and Rollouts

We release the trained Qwen3-14B-base model checkpoints at the 180th training step of both single-agent GRPO and MATPO.

The associated model rollouts across various training steps can be found here.

Key Findings

  • More Stable Training: MATPO exhibits more stable learning curves and avoids catastrophic performance drops observed in single-agent training

  • Robustness to Noise: Multi-agent decomposition effectively isolates noisy tool responses, preventing them from interfering with high-level planning

  • Better Credit Assignment: Principled reward distribution across planner and worker rollouts leads to more effective learning

Practical Implementation Tips

Based on our experiments, we recommend:

  • Final Summary: Final summaries from worker agents are critical for clean planner-worker interfaces
  • Query Recap: Recapping original user query in worker prompt significantly improves performance
  • URL Blocking: Remember to blocking HuggingFace search results to avoid data leakage

Citation

If you find MATPO helpful in your research, please consider citing our paper:

@misc{mo2025multiagenttoolintegratedpolicyoptimization,
      title={Multi-Agent Tool-Integrated Policy Optimization}, 
      author={Zhanfeng Mo and Xingxuan Li and Yuntao Chen and Lidong Bing},
      year={2025},
      eprint={2510.04678},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.04678}, 
}

Acknowledgments

We would like to thank:

  • VolcEngine for developing and open-sourcing veRL, the RL training framework that powers MATPO
  • Alibaba Cloud for the Qwen3 model series
  • Google for the Serper API that enables web search capabilities
  • The authors of GAIA, WebWalkerQA, FRAMES, and MuSiQue datasets
  • The open-source community for valuable feedback and contributions

FAQ

Q: What's the difference between MATPO and traditional multi-agent systems?

MATPO uses a single LLM to play multiple agent roles via different system prompts, rather than deploying separate models. This offers:

  • Lower infrastructure complexity
  • Better parameter efficiency
  • Easier deployment and maintenance
  • Compatible with existing RL frameworks
Q: Can I use MATPO with models other than Qwen3?

Yes! MATPO is model-agnostic. You can use any decoder-only LLM that supports tool calling and multi-turn conversations. We've tested with Qwen3-14B-base, but models like Llama 3, Mistral, or other reasoning-capable LLMs should work.

Q: How many GPUs do I need for training?

For Qwen3-14B-base, we recommend:

  • Training: 8x A100/A800 GPUs (80GB)
  • Inference: 1-2x A100/A800 GPUs (40GB/80GB)
Q: How does MATPO handle credit assignment?

MATPO extends GRPO with principled credit assignment:

  1. The planner's final answer determines the accuracy reward
  2. This reward is normalized across all rollouts in a group
  3. Gradients flow proportionally to both planner and worker actions
  4. Worker agents receive the same advantage value as their parent planner rollout

See our paper for more details.

Q: Can I use MATPO for tasks other than web search?

Absolutely! While our paper focuses on web search, MATPO's framework is general. You can extend it to:

  • Code generation with execution feedback
  • Scientific reasoning with calculator tools
  • Data analysis with pandas/SQL tools
  • Any multi-turn task with verifiable rewards
Q: How stable is MATPO training compared to single-agent RL?

MATPO is significantly more stable. Our experiments show:

  • Single-agent GRPO often suffers catastrophic drops after step 120
  • MATPO maintains steady improvement throughout training
  • Multi-agent structure isolates noisy tool responses, preventing interference

See Figure 4 in our paper for training curves.

Q: Do I need to block HuggingFace URLs during training?

For research integrity, yes - especially if your evaluation benchmarks are hosted on HuggingFace. This prevents models from "cheating" by finding ground-truth answers online.

For production systems with no data leakage concerns, this is optional.


Star ⭐ this repository if you find it helpful!