Improve model card: Add pipeline tag, library name, HF paper link, and sample usage
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,6 +1,9 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
| 4 |
<div align="center">
|
| 5 |
|
| 6 |
# MATPO: Multi-Agent Tool-Integrated Policy Optimization
|
|
@@ -17,7 +20,8 @@ Train Multiple Agent Roles Within a Single LLM via Reinforcement Learning.
|
|
| 17 |
|
| 18 |
[](https://huggingface.co/veggiebird/MATPO-14b)
|
| 19 |
[](https://huggingface.co/datasets/veggiebird/MATPO-data)
|
| 20 |
-
[](https://arxiv.org/abs/2510.04678)
|
|
|
|
| 21 |
[](https://github.com/mzf666/MATPO)
|
| 22 |
</div>
|
| 23 |
|
|
@@ -213,6 +217,77 @@ bash examples/sglang_multiturn/launch.sh \
|
|
| 213 |
examples/sglang_multiturn/qwen3-14b_musique_MATPO.sh
|
| 214 |
```
|
| 215 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 216 |
## Experiments and Results
|
| 217 |
|
| 218 |
### Main Results
|
|
@@ -318,7 +393,7 @@ For Qwen3-14B-base, we recommend:
|
|
| 318 |
MATPO extends GRPO with principled credit assignment:
|
| 319 |
1. The planner's final answer determines the accuracy reward
|
| 320 |
2. This reward is normalized across all rollouts in a group
|
| 321 |
-
3. Gradients flow proportionally to both planner and worker actions
|
| 322 |
4. Worker agents receive the same advantage value as their parent planner rollout
|
| 323 |
|
| 324 |
See our paper for more details.
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: text-generation
|
| 4 |
+
library_name: transformers
|
| 5 |
---
|
| 6 |
+
|
| 7 |
<div align="center">
|
| 8 |
|
| 9 |
# MATPO: Multi-Agent Tool-Integrated Policy Optimization
|
|
|
|
| 20 |
|
| 21 |
[](https://huggingface.co/veggiebird/MATPO-14b)
|
| 22 |
[](https://huggingface.co/datasets/veggiebird/MATPO-data)
|
| 23 |
+
[](https://arxiv.org/abs/2510.04678)
|
| 24 |
+
[](https://huggingface.co/papers/2510.04678)
|
| 25 |
[](https://github.com/mzf666/MATPO)
|
| 26 |
</div>
|
| 27 |
|
|
|
|
| 217 |
examples/sglang_multiturn/qwen3-14b_musique_MATPO.sh
|
| 218 |
```
|
| 219 |
|
| 220 |
+
## Sample Usage
|
| 221 |
+
|
| 222 |
+
You can play with the model yourself using the following Python code snippet for inference:
|
| 223 |
+
|
| 224 |
+
```python
|
| 225 |
+
import matpo
|
| 226 |
+
from matpo.models import LlavaAgent, ClaudeAgent
|
| 227 |
+
from accelerate import Accelerator
|
| 228 |
+
import torch
|
| 229 |
+
from tqdm import tqdm
|
| 230 |
+
from types import SimpleNamespace
|
| 231 |
+
from matpo.environment.webgym import BatchedWebEnv
|
| 232 |
+
import os
|
| 233 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 234 |
+
|
| 235 |
+
# ============= Instanstiate the agent =============
|
| 236 |
+
config_dict = {"use_lora": False,
|
| 237 |
+
"use_q4": False, # our 34b model is quantized to 4-bit, set it to True if you are using 34B model
|
| 238 |
+
"use_anyres": False,
|
| 239 |
+
"temperature": 1.0,
|
| 240 |
+
"max_new_tokens": 512,
|
| 241 |
+
"train_vision": False,
|
| 242 |
+
"num_beams": 1,}
|
| 243 |
+
config = SimpleNamespace(**config_dict)
|
| 244 |
+
|
| 245 |
+
accelerator = Accelerator()
|
| 246 |
+
agent = LlavaAgent(policy_lm = "veggiebird/MATPO-14b", # specify your model here
|
| 247 |
+
device = accelerator.device,
|
| 248 |
+
accelerator = accelerator,
|
| 249 |
+
config = config)
|
| 250 |
+
|
| 251 |
+
# ============= Instanstiate the environment (example, adjust as needed) =============
|
| 252 |
+
test_tasks = [{"web_name": "Google Map",
|
| 253 |
+
"id": "0",
|
| 254 |
+
"ques": "Locate a parking lot near the Brooklyn Bridge that open 24 hours. Review the user comments about it.",
|
| 255 |
+
"web": "https://www.google.com/maps/"}]
|
| 256 |
+
save_path = "path/to/save/results" # Change this
|
| 257 |
+
|
| 258 |
+
test_env = BatchedWebEnv(tasks = test_tasks,
|
| 259 |
+
do_eval = False,
|
| 260 |
+
download_dir=os.path.join(save_path, 'test_driver', 'download'),
|
| 261 |
+
output_dir=os.path.join(save_path, 'test_driver', 'output'),
|
| 262 |
+
batch_size=1,
|
| 263 |
+
max_iter=10,)
|
| 264 |
+
|
| 265 |
+
image_histories = [] # stores the history of the paths of images
|
| 266 |
+
action_histories = [] # stores the history of actions
|
| 267 |
+
|
| 268 |
+
results = test_env.reset()
|
| 269 |
+
image_histories.append(results[0][0]["image"])
|
| 270 |
+
|
| 271 |
+
observations = [r[0] for r in results]
|
| 272 |
+
actions = agent.get_action(observations)
|
| 273 |
+
action_histories.append(actions[0])
|
| 274 |
+
dones = None
|
| 275 |
+
|
| 276 |
+
for _ in tqdm(range(3)): # Iterate for a few steps
|
| 277 |
+
if dones is not None and all(dones):
|
| 278 |
+
break
|
| 279 |
+
results = test_env.step(actions)
|
| 280 |
+
image_histories.append(results[0][0]["image"])
|
| 281 |
+
observations = [r[0] for r in results]
|
| 282 |
+
actions = agent.get_action(observations)
|
| 283 |
+
action_histories.append(actions[0])
|
| 284 |
+
dones = [r[2] for r in results]
|
| 285 |
+
|
| 286 |
+
print("Done!")
|
| 287 |
+
print("image_histories: ", image_histories)
|
| 288 |
+
print("action_histories: ", action_histories)
|
| 289 |
+
```
|
| 290 |
+
|
| 291 |
## Experiments and Results
|
| 292 |
|
| 293 |
### Main Results
|
|
|
|
| 393 |
MATPO extends GRPO with principled credit assignment:
|
| 394 |
1. The planner's final answer determines the accuracy reward
|
| 395 |
2. This reward is normalized across all rollouts in a group
|
| 396 |
+
3. Gradients flow proportionally to both planner actions and worker actions
|
| 397 |
4. Worker agents receive the same advantage value as their parent planner rollout
|
| 398 |
|
| 399 |
See our paper for more details.
|