File size: 6,157 Bytes
00cef0a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d7e0dc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
base_model: Qwen/Qwen2.5-32B-Instruct
language:
- zho
- eng
- fra
- spa
- por
- deu
- ita
- rus
- jpn
- kor
- vie
- tha
- ara
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
tags:
- multi-agent systems
- multiagent-collaboration
- reasoning
- mathematics
- code
model-index:
- name: m1-32b
  results: []
---

[Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning](https://arxiv.org/pdf/2504.09772)

**M1-32B** is a 32B-parameter large language model fine-tuned from [Qwen2.5-32B-Instruct](https://arxiv.org/pdf/2412.15115) on the **M500** dataset—an interdisciplinary multi-agent collaborative reasoning dataset. M1-32B is optimized for improved reasoning, discussion, and decision-making in multi-agent systems (MAS), including frameworks such as [AgentVerse](https://github.com/OpenBMB/AgentVerse).

Code: [https://github.com/jincan333/MAS-TTS](https://github.com/jincan333/MAS-TTS)
Project page: [https://github.com/jincan333/MAS-TTS](https://github.com/jincan333/MAS-TTS)

---

## How to Use with 🤗 Transformers

You can use this model directly with the `transformers` library for text generation.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Can111/m1-32b"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # Use bfloat16 for optimal performance if supported
    device_map="auto" # Automatically distribute model across available devices
)
model.eval() # Set model to evaluation mode

# Define your conversation messages
messages = [
    {"role": "user", "content": "Explain multi-agent collaborative reasoning and its benefits."},
]

# Apply chat template and tokenize inputs
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate response
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)

# Decode and print the generated text
decoded_output = tokenizer.batch_decode(generated_ids[:, model_inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
print(decoded_output)
```

---

## 🚀 Key Features

- 🧠 **Enhanced Collaborative Reasoning**  
  Trained on real multi-agent traces involving diverse roles like Expert Recruiter, Problem Solvers, and Evaluator.

- 🗣️ **Role-Aware Dialogue Generation**  
  Learns to reason and respond from different expert perspectives based on structured prompts.

- ⚙️ **Optimized for Multi-Agent Systems**  
  Performs well as a MAS agent with adaptive collaboration and token budgeting.

---

## 🏗️ Model Training

- **Base Model:** Qwen2.5-32B-Instruct  
- **Dataset:** [M500](https://huggingface.co/datasets/Can111/M500) (500 curated multi-agent reasoning traces)  
- **Objective:** Supervised Fine-Tuning (SFT) on role-conditioned prompts  
- **Training Setup:**  
  - 8 × A100 GPUs  
  - 5 epochs  
  - Learning rate: 1e-5  
  - Frameworks: DeepSpeed, FlashAttention, LLaMA-Factory

---

## 📊 Performance

| **Model**                | **General Understanding** |                | **Mathematical Reasoning** |            | **Coding**     |           |
|--------------------------|---------------------------|----------------|-----------------------------|------------|----------------|-----------|
|                          | **GPQA**                  | **Commongen**  | **AIME2024**                | **MATH-500** | **HumanEval**  | **MBPP-S**|
| **Non-Reasoning Models** |                           |                |                             |            |                |           |
| Qwen2.5                  | 50.2                      | 96.7           | 21.1                        | 84.4       | 89.0           | 80.2      |
| DeepSeek-V3              | **58.6**                  | **98.6**       | **33.3**                    | **88.6**   | 89.6           | 83.9      |
| GPT-4o                   | 49.2                      | 97.8           | 7.8                         | 81.3       | **90.9**       | **85.4**  |
| **Reasoning Models**     |                           |                |                             |            |                |           |
| s1.1-32B                 | 58.3                      | 94.1           | 53.3                        | 90.6       | 82.3           | 77.4      |
| DeepSeek-R1              | **75.5**                  | 97.2           | 78.9                        | **96.2**   | **98.2**       | 91.7      |
| o3-mini                  | 71.3                      | **99.1**       | **84.4**                    | 95.3       | 97.0           | **93.6**  |
| M1-32B (Ours)            | 61.1                      | 96.9           | 60.0                        | 95.1       | 92.8           | 89.1      |
| M1-32B w. CEO (Ours)     | 62.1                      | 97.4           | 62.2                        | 95.8       | 93.9           | 90.5      |

**Table Caption:**  
Performance comparison on general understanding, mathematical reasoning, and coding tasks using strong reasoning and non-reasoning models within the AgentVerse framework. Our method achieves substantial improvements over Qwen2.5 and s1.1-32B on all tasks, and attains performance comparable to o3-mini and DeepSeek-R1 on MATH-500 and MBPP-S, demonstrating its effectiveness in enhancing collaborative reasoning in MAS. Note that the results of s1.1-32B are obtained without using budget forcing.

---

## 💬 Intended Use

M1-32B is intended for research on Multi-agent reasoning and collaboration in MAS

---

## Citation

If you use this model, please cite the relevant papers:

```bibtex
@article{jin2025two,
  title={Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning},
  author={Jin, Can and Peng, Hongwu and Zhang, Qixin and Tang, Yujin and Metaxas, Dimitris N and Che, Tong},
  journal={arXiv preprint arXiv:2504.09772},
  year={2025}
}
```