File size: 6,157 Bytes
00cef0a 1d7e0dc | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | ---
base_model: Qwen/Qwen2.5-32B-Instruct
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
tags:
- multi-agent systems
- multiagent-collaboration
- reasoning
- mathematics
- code
model-index:
- name: m1-32b
results: []
---
[Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning](https://arxiv.org/pdf/2504.09772)
**M1-32B** is a 32B-parameter large language model fine-tuned from [Qwen2.5-32B-Instruct](https://arxiv.org/pdf/2412.15115) on the **M500** dataset—an interdisciplinary multi-agent collaborative reasoning dataset. M1-32B is optimized for improved reasoning, discussion, and decision-making in multi-agent systems (MAS), including frameworks such as [AgentVerse](https://github.com/OpenBMB/AgentVerse).
Code: [https://github.com/jincan333/MAS-TTS](https://github.com/jincan333/MAS-TTS)
Project page: [https://github.com/jincan333/MAS-TTS](https://github.com/jincan333/MAS-TTS)
---
## How to Use with 🤗 Transformers
You can use this model directly with the `transformers` library for text generation.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "Can111/m1-32b"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # Use bfloat16 for optimal performance if supported
device_map="auto" # Automatically distribute model across available devices
)
model.eval() # Set model to evaluation mode
# Define your conversation messages
messages = [
{"role": "user", "content": "Explain multi-agent collaborative reasoning and its benefits."},
]
# Apply chat template and tokenize inputs
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Generate response
generated_ids = model.generate(
model_inputs.input_ids,
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9
)
# Decode and print the generated text
decoded_output = tokenizer.batch_decode(generated_ids[:, model_inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(decoded_output)
```
---
## 🚀 Key Features
- 🧠 **Enhanced Collaborative Reasoning**
Trained on real multi-agent traces involving diverse roles like Expert Recruiter, Problem Solvers, and Evaluator.
- 🗣️ **Role-Aware Dialogue Generation**
Learns to reason and respond from different expert perspectives based on structured prompts.
- ⚙️ **Optimized for Multi-Agent Systems**
Performs well as a MAS agent with adaptive collaboration and token budgeting.
---
## 🏗️ Model Training
- **Base Model:** Qwen2.5-32B-Instruct
- **Dataset:** [M500](https://huggingface.co/datasets/Can111/M500) (500 curated multi-agent reasoning traces)
- **Objective:** Supervised Fine-Tuning (SFT) on role-conditioned prompts
- **Training Setup:**
- 8 × A100 GPUs
- 5 epochs
- Learning rate: 1e-5
- Frameworks: DeepSpeed, FlashAttention, LLaMA-Factory
---
## 📊 Performance
| **Model** | **General Understanding** | | **Mathematical Reasoning** | | **Coding** | |
|--------------------------|---------------------------|----------------|-----------------------------|------------|----------------|-----------|
| | **GPQA** | **Commongen** | **AIME2024** | **MATH-500** | **HumanEval** | **MBPP-S**|
| **Non-Reasoning Models** | | | | | | |
| Qwen2.5 | 50.2 | 96.7 | 21.1 | 84.4 | 89.0 | 80.2 |
| DeepSeek-V3 | **58.6** | **98.6** | **33.3** | **88.6** | 89.6 | 83.9 |
| GPT-4o | 49.2 | 97.8 | 7.8 | 81.3 | **90.9** | **85.4** |
| **Reasoning Models** | | | | | | |
| s1.1-32B | 58.3 | 94.1 | 53.3 | 90.6 | 82.3 | 77.4 |
| DeepSeek-R1 | **75.5** | 97.2 | 78.9 | **96.2** | **98.2** | 91.7 |
| o3-mini | 71.3 | **99.1** | **84.4** | 95.3 | 97.0 | **93.6** |
| M1-32B (Ours) | 61.1 | 96.9 | 60.0 | 95.1 | 92.8 | 89.1 |
| M1-32B w. CEO (Ours) | 62.1 | 97.4 | 62.2 | 95.8 | 93.9 | 90.5 |
**Table Caption:**
Performance comparison on general understanding, mathematical reasoning, and coding tasks using strong reasoning and non-reasoning models within the AgentVerse framework. Our method achieves substantial improvements over Qwen2.5 and s1.1-32B on all tasks, and attains performance comparable to o3-mini and DeepSeek-R1 on MATH-500 and MBPP-S, demonstrating its effectiveness in enhancing collaborative reasoning in MAS. Note that the results of s1.1-32B are obtained without using budget forcing.
---
## 💬 Intended Use
M1-32B is intended for research on Multi-agent reasoning and collaboration in MAS
---
## Citation
If you use this model, please cite the relevant papers:
```bibtex
@article{jin2025two,
title={Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning},
author={Jin, Can and Peng, Hongwu and Zhang, Qixin and Tang, Yujin and Metaxas, Dimitris N and Che, Tong},
journal={arXiv preprint arXiv:2504.09772},
year={2025}
}
``` |