File size: 4,685 Bytes
d55c213
 
6750ea8
fb5d421
 
 
d55c213
fb5d421
 
 
 
 
d55c213
fb5d421
 
 
d55c213
 
e78d518
d55c213
6750ea8
d55c213
fb5d421
 
6750ea8
d55c213
6750ea8
d55c213
6750ea8
 
d55c213
6750ea8
 
d55c213
6750ea8
 
d55c213
6750ea8
d55c213
6750ea8
d55c213
6750ea8
 
 
 
 
 
 
 
d55c213
6750ea8
d55c213
6750ea8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d55c213
6750ea8
d55c213
6750ea8
 
 
 
 
d55c213
6750ea8
d55c213
6750ea8
d55c213
6750ea8
1d7e0dc
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
base_model: Qwen/Qwen2.5-32B-Instruct
language:
- en
library_name: transformers
license: apache-2.0
tags:
- multi-agent systems
- multiagent-collaboration
- reasoning
- mathematics
- code
model-index:
- name: m1-32b
  results: []
pipeline_tag: text-generation
---

[Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning](https://arxiv.org/pdf/2504.09772)

**M1-32B** is a 32B-parameter large language model fine-tuned from [Qwen2.5-32B-Instruct](https://arxiv.org/pdf/2412.15115) on the **M500** dataset—an interdisciplinary multi-agent collaborative reasoning dataset. M1-32B is optimized for improved reasoning, discussion, and decision-making in multi-agent systems (MAS), including frameworks such as [AgentVerse](https://github.com/OpenBMB/AgentVerse).

Code: [https://github.com/jincan333/MAS-TTS](https://github.com/jincan333/MAS-TTS)

---

## 🚀 Key Features

- 🧠 **Enhanced Collaborative Reasoning**  
  Trained on real multi-agent traces involving diverse roles like Expert Recruiter, Problem Solvers, and Evaluator.

- 🗣️ **Role-Aware Dialogue Generation**  
  Learns to reason and respond from different expert perspectives based on structured prompts.

- ⚙️ **Optimized for Multi-Agent Systems**  
  Performs well as a MAS agent with adaptive collaboration and token budgeting.

---

## 🏗️ Model Training

- **Base Model:** Qwen2.5-32B-Instruct  
- **Dataset:** [M500](https://huggingface.co/datasets/Can111/M500) (500 curated multi-agent reasoning traces)  
- **Objective:** Supervised Fine-Tuning (SFT) on role-conditioned prompts  
- **Training Setup:**  
  - 8 × A100 GPUs  
  - 5 epochs  
  - Learning rate: 1e-5  
  - Frameworks: DeepSpeed, FlashAttention, LLaMA-Factory

---

## 📊 Performance

| **Model**                | **General Understanding** |                | **Mathematical Reasoning** |            | **Coding**     |           |
|--------------------------|---------------------------|----------------|-----------------------------|------------|----------------|-----------|
|                          | **GPQA**                  | **Commongen**  | **AIME2024**                | **MATH-500** | **HumanEval**  | **MBPP-S**|
| **Non-Reasoning Models** |                           |                |                             |            |                |           |
| Qwen2.5                  | 50.2                      | 96.7           | 21.1                        | 84.4       | 89.0           | 80.2      |
| DeepSeek-V3              | 58.6                      | 98.6           | 33.3                        | 88.6       | 89.6           | 83.9      |
| GPT-4o                   | 49.2                      | 97.8           | 7.8                         | 81.3       | **90.9**       | **85.4**  |
| **Reasoning Models**     |                           |                |                             |            |                |           |
| s1.1-32B                 | 58.3                      | 94.1           | 53.3                        | 90.6       | 82.3           | 77.4      |
| DeepSeek-R1              | **75.5**                  | 97.2           | 78.9                        | **96.2**   | **98.2**       | 91.7      |
| o3-mini                  | 71.3                      | **99.1**       | **84.4**                    | 95.3       | 97.0           | **93.6**  |
| M1-32B (Ours)            | 61.1                      | 96.9           | 60.0                        | 95.1       | 92.8           | 89.1      |
| M1-32B w. CEO (Ours)     | 62.1                      | 97.4           | 62.2                        | 95.8       | 93.9           | 90.5      |

**Table Caption:**  
Performance comparison on general understanding, mathematical reasoning, and coding tasks using strong reasoning and non-reasoning models within the AgentVerse framework. Our method achieves substantial improvements over Qwen2.5 and s1.1-32B on all tasks, and attains performance comparable to o3-mini and DeepSeek-R1 on MATH-500 and MBPP-S, demonstrating its effectiveness in enhancing collaborative reasoning in MAS. Note that the results of s1.1-32B are obtained without using budget forcing.

---

## 💬 Intended Use

M1-32B is intended for research on Multi-agent reasoning and collaboration in MAS

---

## Citation

If you use this dataset, please cite the relevant papers:

```bibtex
@article{jin2025two,
  title={Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning},
  author={Jin, Can and Peng, Hongwu and Zhang, Qixin and Tang, Yujin and Metaxas, Dimitris N and Che, Tong},
  journal={arXiv preprint arXiv:2504.09772},
  year={2025}
}
```