veggiebird commited on
Commit
62ec9ad
·
verified ·
1 Parent(s): b294dde

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +360 -0
README.md ADDED
@@ -0,0 +1,360 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ <div align="center">
5
+
6
+ # MATPO: Multi-Agent Tool-Integrated Policy Optimization
7
+
8
+ Train Multiple Agent Roles Within a Single LLM via Reinforcement Learning.
9
+
10
+ <!-- [![arXiv](https://img.shields.io/badge/arXiv-Coming_Soon.svg)](https://arxiv.org/pdf/2510.04678)
11
+ [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
12
+ [![Python 3.10+](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/)
13
+ [![Code](https://img.shields.io/badge/code-GitHub-black.svg)](https://github.com/mzf666/MATPO) -->
14
+
15
+ <!-- <hr> -->
16
+ <div align="center">
17
+
18
+ [![Models](https://img.shields.io/badge/Models-5EDDD2?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/veggiebird/MATPO-14b)
19
+ [![Data](https://img.shields.io/badge/Data-0040A1?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/datasets/veggiebird/MATPO-data)
20
+ [![Paper](https://img.shields.io/badge/Paper-000000?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2510.04678)
21
+ [![Github](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/mzf666/MATPO)
22
+ </div>
23
+
24
+
25
+ </div>
26
+
27
+ <div align="center">
28
+ <table>
29
+ <tr>
30
+ <td align="center">
31
+ <img src="assets/main_gaia.png" width="220px" alt="GAIA Results"><br>
32
+ <em>GAIA Results</em>
33
+ </td>
34
+ <td align="center">
35
+ <img src="assets/main_frameqa.png" width="220px" alt="FRAMES Results"><br>
36
+ <em>FRAMES Results</em>
37
+ </td>
38
+ <td align="center">
39
+ <img src="assets/main_webwalkerqa.png" width="220px" alt="WebWalkerQA Results"><br>
40
+ <em>WebWalkerQA Results</em>
41
+ </td>
42
+ </tr>
43
+ </table>
44
+ </div>
45
+
46
+ <p align="center">
47
+ <img src="assets/multi_agent_framework.png" width="500px" alt="MATPO Framework">
48
+ </p>
49
+
50
+
51
+ <p align="center">
52
+ <em>MATPO allows planner and worker agents to coexist within a single LLM and be trained via RL, achieving an 18.38% relative improvement over single-agent baselines on GAIA-text, FRAMES, and WebWalker-QA.</em>
53
+ </p>
54
+
55
+ ## News & Updates
56
+
57
+ - **[2025-Oct-08]** MATPO-Qwen3-14B checkpoints and rollouts released
58
+ - **[2025-Oct-08]** Code and training scripts released
59
+ - **[2025-Oct-06]** Arxiv Paper released
60
+
61
+
62
+ ## Overview
63
+
64
+ **MATPO** (Multi-Agent Tool-Integrated Policy Optimization) is a novel reinforcement learning framework that enables training multiple specialized agent roles (planner and worker agents) within a single large language model.
65
+
66
+ ### The Problem
67
+ Current single-agent approaches for multi-turn tool-integrated planning face critical limitations:
68
+ - **Context Length Bottleneck**: Tool responses (e.g., web scraping) consume excessive tokens, making long-range planning prohibitive
69
+ - **Noisy Tool Responses**: Raw tool responses interfere with the model's attention and planning capabilities
70
+
71
+ ### Our Solution
72
+ MATPO introduces a **multi-agent-in-one-model** architecture where:
73
+ - A **planner-agent** orchestrates high-level planning and delegates subtasks
74
+ - **Worker-agents** handle specific browsing and search tasks with isolated contexts
75
+ - Both roles are trained within a **single LLM** using role-specific prompts via reinforcement learning
76
+
77
+
78
+ ## Key Features
79
+
80
+ - **Multi-Agent-in-One-Model**: Train planner and worker agents within a single LLM using role-specific system prompts
81
+ - **Principled Credit Assignment**: Extends GRPO with theoretically grounded reward distribution across planner and worker rollouts
82
+ - **Easy Integration**: Built on top of [veRL](https://github.com/volcengine/verl), compatible with existing RL training frameworks
83
+ - **Robust Training**: More stable learning curves compared to single-agent approaches, especially with noisy tool responses
84
+ - **Infrastructure Efficient**: No need for deployment of separate models or additional rollout engines
85
+
86
+
87
+ ## MATPO Architecture
88
+
89
+ MATPO employs a hierarchical multi-agent framework where a single LLM serves multiple roles:
90
+
91
+ ```
92
+ User Query → Planner Agent → Subtask 1 → Worker Agent → Result 1
93
+ → Subtask 2 → Worker Agent → Result 2
94
+ → ...
95
+ → Final Answer
96
+ ```
97
+
98
+
99
+ <p align="center">
100
+ <img src="assets/single_agent.png" width="600px" alt="Single-agent GRPO Framework">
101
+ <img src="assets/multi_agent_RL_rollout.png" width="600px" alt="MATPO Framework">
102
+ </p>
103
+
104
+ <p align="center">
105
+ <em>Comparison between the rollout trajectories between the single-agent GRPO (top) and the multi-agent MATPO (bottom).</em>
106
+ </p>
107
+
108
+
109
+ ### Multi-Agent Rollout Process
110
+
111
+ 1. **Planner Agent**:
112
+ - Receives user query with planner-specific system prompt
113
+ - Generates high-level plan and decomposes it into subtasks
114
+ - Delegates subtasks to worker agents
115
+ - Synthesizes worker responses into final answer
116
+
117
+ 2. **Worker Agent**:
118
+ - Receives subtask with worker-specific system prompt
119
+ - Performs multi-turn tool-integrated planning (search, scrape, analyze)
120
+ - Returns summarized result to planner
121
+ - Maintains isolated context to prevent token overflow
122
+
123
+ 3. **Credit Assignment**:
124
+ - Final answer accuracy determines the reward
125
+ - Reward is normalized across all planner-worker rollout groups
126
+ - Gradient flows to both planner actions and worker actions proportionally
127
+
128
+
129
+ <p align="center">
130
+ <img src="assets/multi-agent-grpo-implementation.png" width="600px" alt="MATPO Framework">
131
+ </p>
132
+
133
+ <p align="center">
134
+ <em>Visualization of MATPO implementation.</em>
135
+ </p>
136
+
137
+
138
+
139
+ ## Quick Start
140
+
141
+ Prerequisites:
142
+ - Python 3.10 or higher
143
+ - CUDA 12.4+ (for GPU support)
144
+ - 16 x (8 x 80G-A800) GPUs (for training with Qwen3-14B-base)
145
+
146
+ Clone the repository.
147
+ ```bash
148
+ git clone https://github.com/mzf666/MATPO.git
149
+ cd MATPO
150
+ ```
151
+
152
+ For prerequisites installation (CUDA, cuDNN, Apex), we recommend following the [verl prerequisites guide](https://verl.readthedocs.io/en/latest/start/install.html#pre-requisites) which provides detailed instructions for:
153
+
154
+ - CUDA: Version >= 12.4
155
+ - cuDNN: Version >= 9.8.0
156
+ - Apex
157
+
158
+ Setup environment and install dependencies.
159
+ ```bash
160
+ conda create -n matpo python==3.10 -y
161
+ conda activate matpo
162
+ bash examples/sglang_multiturn/install.sh
163
+ ```
164
+
165
+ Setup Node.js for Serper API support.
166
+
167
+ MCP (Model Context Protocol) requires Node.js to run MCP servers. Node.js version 18+ is recommended for optimal compatibility with MCP tools.
168
+ ```bash
169
+ target_path=YOUR_TARGET_PATH
170
+
171
+ # Download Node.js binary (example for Linux x64)
172
+ wget https://nodejs.org/dist/v24.2.0/node-v24.2.0-linux-x64.tar.xz
173
+
174
+ # Extract to your target path
175
+ tar -xf node-v24.2.0-linux-x64.tar.xz -C $target_path
176
+
177
+ # Add to PATH
178
+ export NODEJS_HOME=$target_path/node-v24.2.0-linux-x64
179
+ export PATH=$NODEJS_HOME/bin:$PATH
180
+ export NODE_SHARED=$target_path/node-shared/node_modules
181
+ export PATH=$NODE_SHARED/.bin:$PATH
182
+
183
+ # Verify installation
184
+ node --version
185
+ npm --version
186
+
187
+ # Install serper mcp server
188
+ mkdir -p $target_path/node-shared
189
+ cd $target_path/node-shared
190
+ npm init -y
191
+ npm install serper-search-scrape-mcp-server
192
+ ```
193
+
194
+ Configure the Node.js paths and HTTP / HTTPS proxies (if necessary) in the `examples/sglang_multiturn/launch.sh` script properly.
195
+
196
+ Download the training and testing datasets to the `data` directory. The prerpocessed datasets can be downloaded [here](https://huggingface.co/datasets/veggiebird/MATPO-data).
197
+
198
+
199
+ Train a Qwen3-14B-base model with MATPO on the MuSiQue dataset and evaluate on the GAIA-text datasets:
200
+
201
+ ```bash
202
+ # tested on 16 x (8 x 80G-A800) nodes
203
+
204
+ export SERPER_API_KEY="YOUR_SERPER_API_KEY" && \
205
+ export OPENAI_API_KEY="YOUR_OPENAI_API_KEY" && \
206
+ export WANDB_API_KEY="YOUR_WANDB_API_KEY" && \
207
+ export SINGLENODE=true && \
208
+ export RAY_DEBUG=legacy && \
209
+ export HYDRA_FULL_ERROR=1 && \
210
+ source YOUR_CONDA_PATH activate matpo && \
211
+ cd YOUR_PROJECT_PATH && \
212
+ bash examples/sglang_multiturn/launch.sh \
213
+ examples/sglang_multiturn/qwen3-14b_musique_MATPO.sh
214
+ ```
215
+
216
+ ## Experiments and Results
217
+
218
+ ### Main Results
219
+
220
+ MATPO consistently outperforms single-agent GRPO baselines across all benchmarks:
221
+
222
+ | Method | GAIA-text | WebWalkerQA | FRAMES | Relative Average Improvement |
223
+ |--------|-----------|-------------|---------|---------------------|
224
+ | Single-Agent GRPO | 32.16% | 30.14% | 56.22% | - |
225
+ | **MATPO (Ours)** | **42.60%** | **33.00%** | **63.64%** | **+18.38%** |
226
+
227
+ ### Training Configuration
228
+
229
+ - **Base Model**: Qwen3-14B-base
230
+ - **Training Dataset**: Filtered MuSiQue dataset.
231
+ - **Training Steps**: 180 steps
232
+ - **Rollouts per Query**: 8 (for group normalization)
233
+ - **Reward Function**: 0.9 × accuracy + 0.1 × tool_format_reward
234
+
235
+ ### Model Checkpoints and Rollouts
236
+
237
+
238
+ We release the trained Qwen3-14B-base model checkpoints at the 180th training step of both [single-agent GRPO](https://huggingface.co/veggiebird/MATPO-single-agent-14b) and [MATPO](https://huggingface.co/veggiebird/MATPO-14b).
239
+
240
+ The associated model rollouts across various training steps can be found [here](https://huggingface.co/datasets/veggiebird/MATPO-rollout).
241
+
242
+
243
+ ### Key Findings
244
+
245
+ - **More Stable Training**: MATPO exhibits more stable learning curves and avoids catastrophic performance drops observed in single-agent training
246
+
247
+ - **Robustness to Noise**: Multi-agent decomposition effectively isolates noisy tool responses, preventing them from interfering with high-level planning
248
+
249
+ - **Better Credit Assignment**: Principled reward distribution across planner and worker rollouts leads to more effective learning
250
+
251
+
252
+ ### Practical Implementation Tips
253
+
254
+ Based on our experiments, we recommend:
255
+
256
+ - **Final Summary**: Final summaries from worker agents are critical for clean planner-worker interfaces
257
+ - **Query Recap**: Recapping original user query in worker prompt significantly improves performance
258
+ - **URL Blocking**: Remember to blocking HuggingFace search results to avoid data leakage
259
+
260
+ ## Citation
261
+
262
+ If you find MATPO helpful in your research, please consider citing our paper:
263
+
264
+ ```bibtex
265
+ @misc{mo2025multiagenttoolintegratedpolicyoptimization,
266
+ title={Multi-Agent Tool-Integrated Policy Optimization},
267
+ author={Zhanfeng Mo and Xingxuan Li and Yuntao Chen and Lidong Bing},
268
+ year={2025},
269
+ eprint={2510.04678},
270
+ archivePrefix={arXiv},
271
+ primaryClass={cs.CL},
272
+ url={https://arxiv.org/abs/2510.04678},
273
+ }
274
+ ```
275
+
276
+
277
+ ## Acknowledgments
278
+
279
+ We would like to thank:
280
+
281
+ - **VolcEngine** for developing and open-sourcing [veRL](https://github.com/volcengine/verl), the RL training framework that powers MATPO
282
+ - **Alibaba Cloud** for the Qwen3 model series
283
+ - **Google** for the Serper API that enables web search capabilities
284
+ - The authors of **GAIA**, **WebWalkerQA**, **FRAMES**, and **MuSiQue** datasets
285
+ - The open-source community for valuable feedback and contributions
286
+
287
+
288
+ ## FAQ
289
+
290
+ <details>
291
+ <summary><b>Q: What's the difference between MATPO and traditional multi-agent systems?</b></summary>
292
+
293
+ MATPO uses a single LLM to play multiple agent roles via different system prompts, rather than deploying separate models. This offers:
294
+ - Lower infrastructure complexity
295
+ - Better parameter efficiency
296
+ - Easier deployment and maintenance
297
+ - Compatible with existing RL frameworks
298
+ </details>
299
+
300
+ <details>
301
+ <summary><b>Q: Can I use MATPO with models other than Qwen3?</b></summary>
302
+
303
+ Yes! MATPO is model-agnostic. You can use any decoder-only LLM that supports tool calling and multi-turn conversations. We've tested with Qwen3-14B-base, but models like Llama 3, Mistral, or other reasoning-capable LLMs should work.
304
+ </details>
305
+
306
+ <details>
307
+ <summary><b>Q: How many GPUs do I need for training?</b></summary>
308
+
309
+ For Qwen3-14B-base, we recommend:
310
+ - **Training**: 8x A100/A800 GPUs (80GB)
311
+ - **Inference**: 1-2x A100/A800 GPUs (40GB/80GB)
312
+
313
+ </details>
314
+
315
+ <details>
316
+ <summary><b>Q: How does MATPO handle credit assignment?</b></summary>
317
+
318
+ MATPO extends GRPO with principled credit assignment:
319
+ 1. The planner's final answer determines the accuracy reward
320
+ 2. This reward is normalized across all rollouts in a group
321
+ 3. Gradients flow proportionally to both planner and worker actions
322
+ 4. Worker agents receive the same advantage value as their parent planner rollout
323
+
324
+ See our paper for more details.
325
+ </details>
326
+
327
+ <details>
328
+ <summary><b>Q: Can I use MATPO for tasks other than web search?</b></summary>
329
+
330
+ Absolutely! While our paper focuses on web search, MATPO's framework is general. You can extend it to:
331
+ - Code generation with execution feedback
332
+ - Scientific reasoning with calculator tools
333
+ - Data analysis with pandas/SQL tools
334
+ - Any multi-turn task with verifiable rewards
335
+ </details>
336
+
337
+ <details>
338
+ <summary><b>Q: How stable is MATPO training compared to single-agent RL?</b></summary>
339
+
340
+ MATPO is significantly more stable. Our experiments show:
341
+ - Single-agent GRPO often suffers catastrophic drops after step 120
342
+ - MATPO maintains steady improvement throughout training
343
+ - Multi-agent structure isolates noisy tool responses, preventing interference
344
+
345
+ See Figure 4 in our paper for training curves.
346
+ </details>
347
+
348
+ <details>
349
+ <summary><b>Q: Do I need to block HuggingFace URLs during training?</b></summary>
350
+
351
+ For research integrity, yes - especially if your evaluation benchmarks are hosted on HuggingFace. This prevents models from "cheating" by finding ground-truth answers online.
352
+
353
+ For production systems with no data leakage concerns, this is optional.
354
+ </details>
355
+
356
+ -----
357
+
358
+ <p align="center">
359
+ <strong>Star ⭐ this repository if you find it helpful!</strong>
360
+ </p>