veggiebird
/

MATPO-single-agent-14b

Safetensors

qwen3

Model card Files Files and versions

xet

Community

Improve model card: Add pipeline tag, library name, HF paper link, and sample usage

by nielsr HF Staff - opened Oct 10, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+77

-2

Files changed (1) hide show

README.md +77 -2

README.md CHANGED Viewed

@@ -1,6 +1,9 @@
 ---
 license: apache-2.0
 ---
 <div align="center">
 # MATPO: Multi-Agent Tool-Integrated Policy Optimization
@@ -17,7 +20,8 @@ Train Multiple Agent Roles Within a Single LLM via Reinforcement Learning.
 [![Models](https://img.shields.io/badge/Models-5EDDD2?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/veggiebird/MATPO-14b)
 [![Data](https://img.shields.io/badge/Data-0040A1?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/datasets/veggiebird/MATPO-data)
-[![Paper](https://img.shields.io/badge/Paper-000000?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2510.04678)
 [![Github](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/mzf666/MATPO)
 </div>
@@ -213,6 +217,77 @@ bash examples/sglang_multiturn/launch.sh \
     examples/sglang_multiturn/qwen3-14b_musique_MATPO.sh
 ```
 ## Experiments and Results
 ### Main Results
@@ -318,7 +393,7 @@ For Qwen3-14B-base, we recommend:
 MATPO extends GRPO with principled credit assignment:
 1. The planner's final answer determines the accuracy reward
 2. This reward is normalized across all rollouts in a group
-3. Gradients flow proportionally to both planner and worker actions
 4. Worker agents receive the same advantage value as their parent planner rollout
 See our paper for more details.

 ---
 license: apache-2.0
+pipeline_tag: text-generation
+library_name: transformers
 ---
 <div align="center">
 # MATPO: Multi-Agent Tool-Integrated Policy Optimization
 [![Models](https://img.shields.io/badge/Models-5EDDD2?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/veggiebird/MATPO-14b)
 [![Data](https://img.shields.io/badge/Data-0040A1?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/datasets/veggiebird/MATPO-data)
+[![Paper (arXiv)](https://img.shields.io/badge/Paper-000000?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2510.04678)
+[![Paper (Hugging Face)](https://img.shields.io/badge/Paper-000000?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/papers/2510.04678)
 [![Github](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/mzf666/MATPO)
 </div>
     examples/sglang_multiturn/qwen3-14b_musique_MATPO.sh
 ```
+## Sample Usage
+You can play with the model yourself using the following Python code snippet for inference:
+```python
+import matpo
+from matpo.models import LlavaAgent, ClaudeAgent
+from accelerate import Accelerator
+import torch
+from tqdm import tqdm
+from types import SimpleNamespace
+from matpo.environment.webgym import BatchedWebEnv
+import os
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# ============= Instanstiate the agent =============
+config_dict = {"use_lora": False,
+               "use_q4": False, # our 34b model is quantized to 4-bit, set it to True if you are using 34B model
+               "use_anyres": False,
+               "temperature": 1.0,
+               "max_new_tokens": 512,
+               "train_vision": False,
+               "num_beams": 1,}
+config = SimpleNamespace(**config_dict)
+accelerator = Accelerator()
+agent = LlavaAgent(policy_lm = "veggiebird/MATPO-14b", # specify your model here
+                            device = accelerator.device,
+                            accelerator = accelerator,
+                            config = config)
+# ============= Instanstiate the environment (example, adjust as needed) =============
+test_tasks = [{"web_name": "Google Map",
+               "id": "0",
+          "ques": "Locate a parking lot near the Brooklyn Bridge that open 24 hours. Review the user comments about it.",
+          "web": "https://www.google.com/maps/"}]
+save_path = "path/to/save/results" # Change this
+test_env = BatchedWebEnv(tasks = test_tasks,
+                        do_eval = False,
+                        download_dir=os.path.join(save_path, 'test_driver', 'download'),
+                        output_dir=os.path.join(save_path, 'test_driver', 'output'),
+                        batch_size=1,
+                        max_iter=10,)
+image_histories = [] # stores the history of the paths of images
+action_histories = [] # stores the history of actions
+results = test_env.reset()
+image_histories.append(results[0][0]["image"])
+observations = [r[0] for r in results]
+actions = agent.get_action(observations)
+action_histories.append(actions[0])
+dones = None
+for _ in tqdm(range(3)): # Iterate for a few steps
+    if dones is not None and all(dones):
+        break
+    results = test_env.step(actions)
+    image_histories.append(results[0][0]["image"])
+    observations = [r[0] for r in results]
+    actions = agent.get_action(observations)
+    action_histories.append(actions[0])
+    dones = [r[2] for r in results]
+print("Done!")
+print("image_histories: ", image_histories)
+print("action_histories: ", action_histories)
+```
 ## Experiments and Results
 ### Main Results
 MATPO extends GRPO with principled credit assignment:
 1. The planner's final answer determines the accuracy reward
 2. This reward is normalized across all rollouts in a group
+3. Gradients flow proportionally to both planner actions and worker actions
 4. Worker agents receive the same advantage value as their parent planner rollout
 See our paper for more details.