Metis-HOME / README.md

Update README.md

9b7f4b6 verified 2 months ago

7.39 kB

	---
	license: apache-2.0
	---

	<h1 align="center">Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning</h1>

	<h5 align="center">

	[![arXiv](https://img.shields.io/badge/Arxiv-2510.20519-b31b1b.svg?logo=arXiv)](https://arxiv.org/pdf/2510.20519)&ensp;<a href='https://huggingface.co/mmthinking/Metis-HOME'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face%20-models-blue'></a>&ensp;[![Code License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)

	</h5>


	## 💡 Overview
	Current multimodal reasoning models face a critical dilemma: they often "overthink" on simple tasks (inefficiency) and suffer from general capability degradation when optimized for reasoning.

	We introduce Metis-HOME (Hybrid Optimized Mixture-of-Experts), a novel framework that enables a "Hybrid Thinking" paradigm. By structuring the original dense model (Qwen2.5-VL-7B) into two distinct expert branches—a Thinking Expert for complex reasoning and a Non-Thinking Expert for rapid inference—controlled by a lightweight router, Metis-HOME effectively resolves the reasoning-vs-generalization trade-off.

	<div style="display: flex; justify-content: center; gap: 20px; flex-wrap: wrap;">
	<img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/framework.png" alt="Metis-RISE Framework Overview" style="width:400px; max-width:100%;">
	<img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/radar_chart.png" alt="Metis-RISE Framework Overview" style="width:400px; max-width:100%;">
	</div>

	## ✨ Highlights

	- 🧠 Hybrid Thinking Paradigm: Explicitly decouples "System 1" (fast, intuitive) and "System 2" (slow, deliberative) reasoning within a unified multimodal MoE architecture.
	- 🔄 Router Mechanism: A lightweight, trainable router dynamically allocates queries based on complexity, avoiding computational waste on simple tasks like OCR or Captioning.
	- 🚀 Performance:
	- +6.9% improvement on reasoning benchmarks (MathVista, etc.) compared to the baseline.
	- ~1% gain on general benchmarks, reversing the degradation trend observed in other reasoning-specialized models.

	- 🛠️ Efficient Training: A multi-stage strategy combining Reinforcement Learning (RL) for reasoning enhancement and Mixed Supervised Fine-Tuning (SFT) for expert specialization.


	## 📊 Results

	### Thinking Ratio
	As shown in the following figure, the thinking ratio analysis of Metis-HOME reveals adaptive routing behavior:
	- High ratios (78\%–98\%) on reasoning-heavy benchmarks (WeMath, MathVision, etc.), indicating effective use of the thinking expert for multi-step inference.
	- Low ratios (2\%–5\%) on general benchmarks (MMBench, OCRBench), showing preference for the non-thinking expert.

	This aligns with our design: deliberate reasoning for complex tasks, fast inference for simple ones, optimizing computational efficiency.

	<img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/thinking_ratio_chart.png" alt="Metis-RISE Framework Overview" style="width:850px; max-width:100%;">


	### Benchmarks
	<table>
	<thead>
	<tr>
	<th rowspan="2" style="text-align:left; vertical-align:bottom;">Model</th>
	<th colspan="7" style="text-align:center; border-bottom:1px solid #ccc;">Reasoning</th>
	<th style="text-align:center; border-bottom:1px solid #ccc;">General</th>
	</tr>
	<tr>
	<th>MathVista</th>
	<th>MathVision</th>
	<th>MathVerse</th>
	<th>DynaMath</th>
	<th>WeMath</th>
	<th>LogicVista</th>
	<th>Avg.</th>
	<th>Avg.</th>
	</tr>
	</thead>
	<tbody>

	<tr style="background-color: #e0e0e0;">
	<td colspan="9" align="center"><strong><em>Proprietary Models</em></strong></td>
	</tr>

	<tr>
	<td>Gemini-2.0-Pro</td>
	<td>71.3</td>
	<td>48.1</td>
	<td>67.3</td>
	<td>43.3</td>
	<td>56.5</td>
	<td>53.2</td>
	<td>56.6</td>
	<td>73.3</td>
	</tr>
	<tr>
	<td>Gemini-2.0-Flash</td>
	<td>70.4</td>
	<td>43.6</td>
	<td>47.8</td>
	<td>42.1</td>
	<td>47.4</td>
	<td>52.3</td>
	<td>50.6</td>
	<td>72.6</td>
	</tr>
	<tr>
	<td>Claude 3.7 Sonnet</td>
	<td>66.8</td>
	<td>41.9</td>
	<td>46.7</td>
	<td>39.7</td>
	<td>49.3</td>
	<td>58.2</td>
	<td>50.4</td>
	<td>70.1</td>
	</tr>
	<tr>
	<td>ChatGPT-4o</td>
	<td>60.0</td>
	<td>31.2</td>
	<td>40.6</td>
	<td>34.5</td>
	<td>45.8</td>
	<td>52.8</td>
	<td>44.2</td>
	<td>72.0</td>
	</tr>


	<tr style="background-color: #e0e0e0;">
	<td colspan="9" align="center"><strong><em>Open-source Models</em></strong></td>
	</tr>

	<tr>
	<td>LLaVA-OneVision-72B</td>
	<td>67.1</td>
	<td>25.3</td>
	<td>27.2</td>
	<td>15.6</td>
	<td>32.0</td>
	<td>40.9</td>
	<td>34.7</td>
	<td>68.0</td>
	</tr>
	<tr>
	<td>Kimi-VL-A3B-Instruct</td>
	<td>66.0</td>
	<td>21.8</td>
	<td>34.1</td>
	<td>18.0</td>
	<td>32.3</td>
	<td>42.7</td>
	<td>35.8</td>
	<td>69.1</td>
	</tr>
	<tr>
	<td>InternVL3-8B</td>
	<td>70.5</td>
	<td>30.0</td>
	<td>38.5</td>
	<td>25.7</td>
	<td>39.5</td>
	<td>44.5</td>
	<td>41.4</td>
	<td>73.6</td>
	</tr>
	<tr>
	<td>VL-Rethinker-7B</td>
	<td>75.5</td>
	<td>29.3</td>
	<td>47.2</td>
	<td>25.4</td>
	<td>37.8</td>
	<td>47.0</td>
	<td>43.7</td>
	<td>68.3</td>
	</tr>
	<tr>
	<td>Metis-RISE-7B</td>
	<td>75.8</td>
	<td>28.7</td>
	<td>51.0</td>
	<td>27.7</td>
	<td>45.2</td>
	<td>49.7</td>
	<td>46.4</td>
	<td>68.4</td>
	</tr>

	<tr>
	<td style="border-top: 1px solid #000;">Baseline</td>
	<td style="border-top: 1px solid #000;">67.4</td>
	<td style="border-top: 1px solid #000;">26.2</td>
	<td style="border-top: 1px solid #000;">41.1</td>
	<td style="border-top: 1px solid #000;">20.2</td>
	<td style="border-top: 1px solid #000;">34.5</td>
	<td style="border-top: 1px solid #000;">45.6</td>
	<td style="border-top: 1px solid #000; background-color: #fff2cc;">39.2</td>
	<td style="border-top: 1px solid #000;">70.3</td>
	</tr>
	<tr>
	<td>Baseline+RL</td>
	<td>72.8</td>
	<td>28.7</td>
	<td>46.8</td>
	<td>26.2</td>
	<td>43.3</td>
	<td>46.5</td>
	<td>44.0</td>
	<td style="background-color: #e1d5e7;">67.2</td>
	</tr>
	<tr>
	<td><b>Metis-HOME</b></td>
	<td>76.0</td>
	<td>29.5</td>
	<td>47.7</td>
	<td>26.4</td>
	<td>45.6</td>
	<td>51.5</td>
	<td style="background-color: #fff2cc;"><b>46.1</b></td>
	<td style="background-color: #e1d5e7;"><b>71.2</b></td>
	</tr>
	</tbody>
	</table>


	## 🔍 Usage Example

	You can use the demo inference script in the `examples` folder:

	```bash
	python examples/demo_inference.py
	```

	## 📌 Acknowledgement
	We sincerely appreciate [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [MM-EUREKA](https://github.com/ModalMinds/MM-EUREKA) for providing reference training framework.

	## 📖 Citation

	```bibtex
	@article{lan2025metis,
	title={Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning},
	author={Lan, Xiaohan and Liu, Fanfan and Qiu, Haibo and Yang, Siqi and Ruan, Delian and Shi, Peng and Ma, Lin},
	journal={arXiv preprint arXiv:2510.20519},
	year={2025}
	}
	```