m1-32b / README.md

Add usage example and explicit project page link (#3)

00cef0a verified 5 months ago

6.16 kB

	---
	base_model: Qwen/Qwen2.5-32B-Instruct
	language:
	- zho
	- eng
	- fra
	- spa
	- por
	- deu
	- ita
	- rus
	- jpn
	- kor
	- vie
	- tha
	- ara
	library_name: transformers
	license: apache-2.0
	pipeline_tag: text-generation
	tags:
	- multi-agent systems
	- multiagent-collaboration
	- reasoning
	- mathematics
	- code
	model-index:
	- name: m1-32b
	results: []
	---

	[Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning](https://arxiv.org/pdf/2504.09772)

	M1-32B is a 32B-parameter large language model fine-tuned from [Qwen2.5-32B-Instruct](https://arxiv.org/pdf/2412.15115) on the M500 dataset—an interdisciplinary multi-agent collaborative reasoning dataset. M1-32B is optimized for improved reasoning, discussion, and decision-making in multi-agent systems (MAS), including frameworks such as [AgentVerse](https://github.com/OpenBMB/AgentVerse).

	Code: [https://github.com/jincan333/MAS-TTS](https://github.com/jincan333/MAS-TTS)
	Project page: [https://github.com/jincan333/MAS-TTS](https://github.com/jincan333/MAS-TTS)

	---

	## How to Use with 🤗 Transformers

	You can use this model directly with the `transformers` library for text generation.

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_id = "Can111/m1-32b"

	# Load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16, # Use bfloat16 for optimal performance if supported
	device_map="auto" # Automatically distribute model across available devices
	)
	model.eval() # Set model to evaluation mode

	# Define your conversation messages
	messages = [
	{"role": "user", "content": "Explain multi-agent collaborative reasoning and its benefits."},
	]

	# Apply chat template and tokenize inputs
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	# Generate response
	generated_ids = model.generate(
	model_inputs.input_ids,
	max_new_tokens=256,
	do_sample=True,
	temperature=0.7,
	top_p=0.9
	)

	# Decode and print the generated text
	decoded_output = tokenizer.batch_decode(generated_ids[:, model_inputs.input_ids.shape[1]:], skip_special_tokens=True)[0]
	print(decoded_output)
	```

	---

	## 🚀 Key Features

	- 🧠 Enhanced Collaborative Reasoning
	Trained on real multi-agent traces involving diverse roles like Expert Recruiter, Problem Solvers, and Evaluator.

	- 🗣️ Role-Aware Dialogue Generation
	Learns to reason and respond from different expert perspectives based on structured prompts.

	- ⚙️ Optimized for Multi-Agent Systems
	Performs well as a MAS agent with adaptive collaboration and token budgeting.

	---

	## 🏗️ Model Training

	- Base Model: Qwen2.5-32B-Instruct
	- Dataset: [M500](https://huggingface.co/datasets/Can111/M500) (500 curated multi-agent reasoning traces)
	- Objective: Supervised Fine-Tuning (SFT) on role-conditioned prompts
	- Training Setup:
	- 8 × A100 GPUs
	- 5 epochs
	- Learning rate: 1e-5
	- Frameworks: DeepSpeed, FlashAttention, LLaMA-Factory

	---

	## 📊 Performance

	\| Model \| General Understanding \| \| Mathematical Reasoning \| \| Coding \| \|
	\|--------------------------\|---------------------------\|----------------\|-----------------------------\|------------\|----------------\|-----------\|
	\| \| GPQA \| Commongen \| AIME2024 \| MATH-500 \| HumanEval \| MBPP-S\|
	\| Non-Reasoning Models \| \| \| \| \| \| \|
	\| Qwen2.5 \| 50.2 \| 96.7 \| 21.1 \| 84.4 \| 89.0 \| 80.2 \|
	\| DeepSeek-V3 \| 58.6 \| 98.6 \| 33.3 \| 88.6 \| 89.6 \| 83.9 \|
	\| GPT-4o \| 49.2 \| 97.8 \| 7.8 \| 81.3 \| 90.9 \| 85.4 \|
	\| Reasoning Models \| \| \| \| \| \| \|
	\| s1.1-32B \| 58.3 \| 94.1 \| 53.3 \| 90.6 \| 82.3 \| 77.4 \|
	\| DeepSeek-R1 \| 75.5 \| 97.2 \| 78.9 \| 96.2 \| 98.2 \| 91.7 \|
	\| o3-mini \| 71.3 \| 99.1 \| 84.4 \| 95.3 \| 97.0 \| 93.6 \|
	\| M1-32B (Ours) \| 61.1 \| 96.9 \| 60.0 \| 95.1 \| 92.8 \| 89.1 \|
	\| M1-32B w. CEO (Ours) \| 62.1 \| 97.4 \| 62.2 \| 95.8 \| 93.9 \| 90.5 \|

	Table Caption:
	Performance comparison on general understanding, mathematical reasoning, and coding tasks using strong reasoning and non-reasoning models within the AgentVerse framework. Our method achieves substantial improvements over Qwen2.5 and s1.1-32B on all tasks, and attains performance comparable to o3-mini and DeepSeek-R1 on MATH-500 and MBPP-S, demonstrating its effectiveness in enhancing collaborative reasoning in MAS. Note that the results of s1.1-32B are obtained without using budget forcing.

	---

	## 💬 Intended Use

	M1-32B is intended for research on Multi-agent reasoning and collaboration in MAS

	---

	## Citation

	If you use this model, please cite the relevant papers:

	```bibtex
	@article{jin2025two,
	title={Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning},
	author={Jin, Can and Peng, Hongwu and Zhang, Qixin and Tang, Yujin and Metaxas, Dimitris N and Che, Tong},
	journal={arXiv preprint arXiv:2504.09772},
	year={2025}
	}
	```