|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
<h1 align="center">Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning</h1> |
|
|
|
|
|
<h5 align="center"> |
|
|
|
|
|
[](https://arxiv.org/pdf/2510.20519) <a href='https://huggingface.co/mmthinking/Metis-HOME'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face%20-models-blue'></a> [](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE) |
|
|
|
|
|
</h5> |
|
|
|
|
|
|
|
|
## 💡 Overview |
|
|
Current multimodal reasoning models face a critical dilemma: they often "overthink" on simple tasks (inefficiency) and suffer from general capability degradation when optimized for reasoning. |
|
|
|
|
|
We introduce **Metis-HOME** (**H**ybrid **O**ptimized **M**ixture-of-**E**xperts), a novel framework that enables a "Hybrid Thinking" paradigm. By structuring the original dense model (Qwen2.5-VL-7B) into two distinct expert branches—a Thinking Expert for complex reasoning and a Non-Thinking Expert for rapid inference—controlled by a lightweight router, Metis-HOME effectively resolves the reasoning-vs-generalization trade-off. |
|
|
|
|
|
<div style="display: flex; justify-content: center; gap: 20px; flex-wrap: wrap;"> |
|
|
<img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/framework.png" alt="Metis-RISE Framework Overview" style="width:400px; max-width:100%;"> |
|
|
<img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/radar_chart.png" alt="Metis-RISE Framework Overview" style="width:400px; max-width:100%;"> |
|
|
</div> |
|
|
|
|
|
## ✨ Highlights |
|
|
|
|
|
- 🧠 Hybrid Thinking Paradigm: Explicitly decouples "System 1" (fast, intuitive) and "System 2" (slow, deliberative) reasoning within a unified multimodal MoE architecture. |
|
|
- 🔄 Router Mechanism: A lightweight, trainable router dynamically allocates queries based on complexity, avoiding computational waste on simple tasks like OCR or Captioning. |
|
|
- 🚀 Performance: |
|
|
- +6.9% improvement on reasoning benchmarks (MathVista, etc.) compared to the baseline. |
|
|
- ~1% gain on general benchmarks, reversing the degradation trend observed in other reasoning-specialized models. |
|
|
|
|
|
- 🛠️ Efficient Training: A multi-stage strategy combining Reinforcement Learning (RL) for reasoning enhancement and Mixed Supervised Fine-Tuning (SFT) for expert specialization. |
|
|
|
|
|
|
|
|
## 📊 Results |
|
|
|
|
|
### Thinking Ratio |
|
|
As shown in the following figure, the **thinking ratio** analysis of Metis-HOME reveals adaptive routing behavior: |
|
|
- **High ratios (78\%–98\%)** on reasoning-heavy benchmarks (*WeMath*, *MathVision*, etc.), indicating effective use of the *thinking expert* for multi-step inference. |
|
|
- **Low ratios (2\%–5\%)** on general benchmarks (*MMBench*, *OCRBench*), showing preference for the *non-thinking expert*. |
|
|
|
|
|
This aligns with our design: **deliberate reasoning for complex tasks**, **fast inference for simple ones**, optimizing computational efficiency. |
|
|
|
|
|
<img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/thinking_ratio_chart.png" alt="Metis-RISE Framework Overview" style="width:850px; max-width:100%;"> |
|
|
|
|
|
|
|
|
### Benchmarks |
|
|
<table> |
|
|
<thead> |
|
|
<tr> |
|
|
<th rowspan="2" style="text-align:left; vertical-align:bottom;">Model</th> |
|
|
<th colspan="7" style="text-align:center; border-bottom:1px solid #ccc;">Reasoning</th> |
|
|
<th style="text-align:center; border-bottom:1px solid #ccc;">General</th> |
|
|
</tr> |
|
|
<tr> |
|
|
<th>MathVista</th> |
|
|
<th>MathVision</th> |
|
|
<th>MathVerse</th> |
|
|
<th>DynaMath</th> |
|
|
<th>WeMath</th> |
|
|
<th>LogicVista</th> |
|
|
<th>Avg.</th> |
|
|
<th>Avg.</th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
|
|
|
<tr style="background-color: #e0e0e0;"> |
|
|
<td colspan="9" align="center"><strong><em>Proprietary Models</em></strong></td> |
|
|
</tr> |
|
|
|
|
|
<tr> |
|
|
<td>Gemini-2.0-Pro</td> |
|
|
<td>71.3</td> |
|
|
<td>48.1</td> |
|
|
<td>67.3</td> |
|
|
<td>43.3</td> |
|
|
<td>56.5</td> |
|
|
<td>53.2</td> |
|
|
<td>56.6</td> |
|
|
<td>73.3</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Gemini-2.0-Flash</td> |
|
|
<td>70.4</td> |
|
|
<td>43.6</td> |
|
|
<td>47.8</td> |
|
|
<td>42.1</td> |
|
|
<td>47.4</td> |
|
|
<td>52.3</td> |
|
|
<td>50.6</td> |
|
|
<td>72.6</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Claude 3.7 Sonnet</td> |
|
|
<td>66.8</td> |
|
|
<td>41.9</td> |
|
|
<td>46.7</td> |
|
|
<td>39.7</td> |
|
|
<td>49.3</td> |
|
|
<td>58.2</td> |
|
|
<td>50.4</td> |
|
|
<td>70.1</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>ChatGPT-4o</td> |
|
|
<td>60.0</td> |
|
|
<td>31.2</td> |
|
|
<td>40.6</td> |
|
|
<td>34.5</td> |
|
|
<td>45.8</td> |
|
|
<td>52.8</td> |
|
|
<td>44.2</td> |
|
|
<td>72.0</td> |
|
|
</tr> |
|
|
|
|
|
|
|
|
<tr style="background-color: #e0e0e0;"> |
|
|
<td colspan="9" align="center"><strong><em>Open-source Models</em></strong></td> |
|
|
</tr> |
|
|
|
|
|
<tr> |
|
|
<td>LLaVA-OneVision-72B</td> |
|
|
<td>67.1</td> |
|
|
<td>25.3</td> |
|
|
<td>27.2</td> |
|
|
<td>15.6</td> |
|
|
<td>32.0</td> |
|
|
<td>40.9</td> |
|
|
<td>34.7</td> |
|
|
<td>68.0</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Kimi-VL-A3B-Instruct</td> |
|
|
<td>66.0</td> |
|
|
<td>21.8</td> |
|
|
<td>34.1</td> |
|
|
<td>18.0</td> |
|
|
<td>32.3</td> |
|
|
<td>42.7</td> |
|
|
<td>35.8</td> |
|
|
<td>69.1</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>InternVL3-8B</td> |
|
|
<td>70.5</td> |
|
|
<td>30.0</td> |
|
|
<td>38.5</td> |
|
|
<td>25.7</td> |
|
|
<td>39.5</td> |
|
|
<td>44.5</td> |
|
|
<td>41.4</td> |
|
|
<td>73.6</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>VL-Rethinker-7B</td> |
|
|
<td>75.5</td> |
|
|
<td>29.3</td> |
|
|
<td>47.2</td> |
|
|
<td>25.4</td> |
|
|
<td>37.8</td> |
|
|
<td>47.0</td> |
|
|
<td>43.7</td> |
|
|
<td>68.3</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Metis-RISE-7B</td> |
|
|
<td>75.8</td> |
|
|
<td>28.7</td> |
|
|
<td>51.0</td> |
|
|
<td>27.7</td> |
|
|
<td>45.2</td> |
|
|
<td>49.7</td> |
|
|
<td>46.4</td> |
|
|
<td>68.4</td> |
|
|
</tr> |
|
|
|
|
|
<tr> |
|
|
<td style="border-top: 1px solid #000;">Baseline</td> |
|
|
<td style="border-top: 1px solid #000;">67.4</td> |
|
|
<td style="border-top: 1px solid #000;">26.2</td> |
|
|
<td style="border-top: 1px solid #000;">41.1</td> |
|
|
<td style="border-top: 1px solid #000;">20.2</td> |
|
|
<td style="border-top: 1px solid #000;">34.5</td> |
|
|
<td style="border-top: 1px solid #000;">45.6</td> |
|
|
<td style="border-top: 1px solid #000; background-color: #fff2cc;">39.2</td> |
|
|
<td style="border-top: 1px solid #000;">70.3</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Baseline+RL</td> |
|
|
<td>72.8</td> |
|
|
<td>28.7</td> |
|
|
<td>46.8</td> |
|
|
<td>26.2</td> |
|
|
<td>43.3</td> |
|
|
<td>46.5</td> |
|
|
<td>44.0</td> |
|
|
<td style="background-color: #e1d5e7;">67.2</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><b>Metis-HOME</b></td> |
|
|
<td>76.0</td> |
|
|
<td>29.5</td> |
|
|
<td>47.7</td> |
|
|
<td>26.4</td> |
|
|
<td>45.6</td> |
|
|
<td>51.5</td> |
|
|
<td style="background-color: #fff2cc;"><b>46.1</b></td> |
|
|
<td style="background-color: #e1d5e7;"><b>71.2</b></td> |
|
|
</tr> |
|
|
</tbody> |
|
|
</table> |
|
|
|
|
|
|
|
|
## 🔍 Usage Example |
|
|
|
|
|
You can use the demo inference script in the `examples` folder: |
|
|
|
|
|
```bash |
|
|
python examples/demo_inference.py |
|
|
``` |
|
|
|
|
|
## 📌 Acknowledgement |
|
|
We sincerely appreciate [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [MM-EUREKA](https://github.com/ModalMinds/MM-EUREKA) for providing reference training framework. |
|
|
|
|
|
## 📖 Citation |
|
|
|
|
|
```bibtex |
|
|
@article{lan2025metis, |
|
|
title={Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning}, |
|
|
author={Lan, Xiaohan and Liu, Fanfan and Qiu, Haibo and Yang, Siqi and Ruan, Delian and Shi, Peng and Ma, Lin}, |
|
|
journal={arXiv preprint arXiv:2510.20519}, |
|
|
year={2025} |
|
|
} |
|
|
``` |