Improve model card: Add pipeline tag, library name, HF paper link, and sample usage

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +77 -2
README.md CHANGED
@@ -1,6 +1,9 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
 
4
  <div align="center">
5
 
6
  # MATPO: Multi-Agent Tool-Integrated Policy Optimization
@@ -17,7 +20,8 @@ Train Multiple Agent Roles Within a Single LLM via Reinforcement Learning.
17
 
18
  [![Models](https://img.shields.io/badge/Models-5EDDD2?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/veggiebird/MATPO-14b)
19
  [![Data](https://img.shields.io/badge/Data-0040A1?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/datasets/veggiebird/MATPO-data)
20
- [![Paper](https://img.shields.io/badge/Paper-000000?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2510.04678)
 
21
  [![Github](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/mzf666/MATPO)
22
  </div>
23
 
@@ -213,6 +217,77 @@ bash examples/sglang_multiturn/launch.sh \
213
  examples/sglang_multiturn/qwen3-14b_musique_MATPO.sh
214
  ```
215
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
216
  ## Experiments and Results
217
 
218
  ### Main Results
@@ -318,7 +393,7 @@ For Qwen3-14B-base, we recommend:
318
  MATPO extends GRPO with principled credit assignment:
319
  1. The planner's final answer determines the accuracy reward
320
  2. This reward is normalized across all rollouts in a group
321
- 3. Gradients flow proportionally to both planner and worker actions
322
  4. Worker agents receive the same advantage value as their parent planner rollout
323
 
324
  See our paper for more details.
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
  ---
6
+
7
  <div align="center">
8
 
9
  # MATPO: Multi-Agent Tool-Integrated Policy Optimization
 
20
 
21
  [![Models](https://img.shields.io/badge/Models-5EDDD2?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/veggiebird/MATPO-14b)
22
  [![Data](https://img.shields.io/badge/Data-0040A1?style=for-the-badge&logo=huggingface&logoColor=ffffff&labelColor)](https://huggingface.co/datasets/veggiebird/MATPO-data)
23
+ [![Paper (arXiv)](https://img.shields.io/badge/Paper-000000?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2510.04678)
24
+ [![Paper (Hugging Face)](https://img.shields.io/badge/Paper-000000?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/papers/2510.04678)
25
  [![Github](https://img.shields.io/badge/Code-000000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/mzf666/MATPO)
26
  </div>
27
 
 
217
  examples/sglang_multiturn/qwen3-14b_musique_MATPO.sh
218
  ```
219
 
220
+ ## Sample Usage
221
+
222
+ You can play with the model yourself using the following Python code snippet for inference:
223
+
224
+ ```python
225
+ import matpo
226
+ from matpo.models import LlavaAgent, ClaudeAgent
227
+ from accelerate import Accelerator
228
+ import torch
229
+ from tqdm import tqdm
230
+ from types import SimpleNamespace
231
+ from matpo.environment.webgym import BatchedWebEnv
232
+ import os
233
+ from transformers import AutoModelForCausalLM, AutoTokenizer
234
+
235
+ # ============= Instanstiate the agent =============
236
+ config_dict = {"use_lora": False,
237
+ "use_q4": False, # our 34b model is quantized to 4-bit, set it to True if you are using 34B model
238
+ "use_anyres": False,
239
+ "temperature": 1.0,
240
+ "max_new_tokens": 512,
241
+ "train_vision": False,
242
+ "num_beams": 1,}
243
+ config = SimpleNamespace(**config_dict)
244
+
245
+ accelerator = Accelerator()
246
+ agent = LlavaAgent(policy_lm = "veggiebird/MATPO-14b", # specify your model here
247
+ device = accelerator.device,
248
+ accelerator = accelerator,
249
+ config = config)
250
+
251
+ # ============= Instanstiate the environment (example, adjust as needed) =============
252
+ test_tasks = [{"web_name": "Google Map",
253
+ "id": "0",
254
+ "ques": "Locate a parking lot near the Brooklyn Bridge that open 24 hours. Review the user comments about it.",
255
+ "web": "https://www.google.com/maps/"}]
256
+ save_path = "path/to/save/results" # Change this
257
+
258
+ test_env = BatchedWebEnv(tasks = test_tasks,
259
+ do_eval = False,
260
+ download_dir=os.path.join(save_path, 'test_driver', 'download'),
261
+ output_dir=os.path.join(save_path, 'test_driver', 'output'),
262
+ batch_size=1,
263
+ max_iter=10,)
264
+
265
+ image_histories = [] # stores the history of the paths of images
266
+ action_histories = [] # stores the history of actions
267
+
268
+ results = test_env.reset()
269
+ image_histories.append(results[0][0]["image"])
270
+
271
+ observations = [r[0] for r in results]
272
+ actions = agent.get_action(observations)
273
+ action_histories.append(actions[0])
274
+ dones = None
275
+
276
+ for _ in tqdm(range(3)): # Iterate for a few steps
277
+ if dones is not None and all(dones):
278
+ break
279
+ results = test_env.step(actions)
280
+ image_histories.append(results[0][0]["image"])
281
+ observations = [r[0] for r in results]
282
+ actions = agent.get_action(observations)
283
+ action_histories.append(actions[0])
284
+ dones = [r[2] for r in results]
285
+
286
+ print("Done!")
287
+ print("image_histories: ", image_histories)
288
+ print("action_histories: ", action_histories)
289
+ ```
290
+
291
  ## Experiments and Results
292
 
293
  ### Main Results
 
393
  MATPO extends GRPO with principled credit assignment:
394
  1. The planner's final answer determines the accuracy reward
395
  2. This reward is normalized across all rollouts in a group
396
+ 3. Gradients flow proportionally to both planner actions and worker actions
397
  4. Worker agents receive the same advantage value as their parent planner rollout
398
 
399
  See our paper for more details.