Metis-HOME / README.md
haiboqiu's picture
Update README.md
9b7f4b6 verified
---
license: apache-2.0
---
<h1 align="center">Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning</h1>
<h5 align="center">
[![arXiv](https://img.shields.io/badge/Arxiv-2510.20519-b31b1b.svg?logo=arXiv)](https://arxiv.org/pdf/2510.20519)&ensp;<a href='https://huggingface.co/mmthinking/Metis-HOME'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face%20-models-blue'></a>&ensp;[![Code License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)
</h5>
## 💡 Overview
Current multimodal reasoning models face a critical dilemma: they often "overthink" on simple tasks (inefficiency) and suffer from general capability degradation when optimized for reasoning.
We introduce **Metis-HOME** (**H**ybrid **O**ptimized **M**ixture-of-**E**xperts), a novel framework that enables a "Hybrid Thinking" paradigm. By structuring the original dense model (Qwen2.5-VL-7B) into two distinct expert branches—a Thinking Expert for complex reasoning and a Non-Thinking Expert for rapid inference—controlled by a lightweight router, Metis-HOME effectively resolves the reasoning-vs-generalization trade-off.
<div style="display: flex; justify-content: center; gap: 20px; flex-wrap: wrap;">
<img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/framework.png" alt="Metis-RISE Framework Overview" style="width:400px; max-width:100%;">
<img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/radar_chart.png" alt="Metis-RISE Framework Overview" style="width:400px; max-width:100%;">
</div>
## ✨ Highlights
- 🧠 Hybrid Thinking Paradigm: Explicitly decouples "System 1" (fast, intuitive) and "System 2" (slow, deliberative) reasoning within a unified multimodal MoE architecture.
- 🔄 Router Mechanism: A lightweight, trainable router dynamically allocates queries based on complexity, avoiding computational waste on simple tasks like OCR or Captioning.
- 🚀 Performance:
- +6.9% improvement on reasoning benchmarks (MathVista, etc.) compared to the baseline.
- ~1% gain on general benchmarks, reversing the degradation trend observed in other reasoning-specialized models.
- 🛠️ Efficient Training: A multi-stage strategy combining Reinforcement Learning (RL) for reasoning enhancement and Mixed Supervised Fine-Tuning (SFT) for expert specialization.
## 📊 Results
### Thinking Ratio
As shown in the following figure, the **thinking ratio** analysis of Metis-HOME reveals adaptive routing behavior:
- **High ratios (78\%–98\%)** on reasoning-heavy benchmarks (*WeMath*, *MathVision*, etc.), indicating effective use of the *thinking expert* for multi-step inference.
- **Low ratios (2\%–5\%)** on general benchmarks (*MMBench*, *OCRBench*), showing preference for the *non-thinking expert*.
This aligns with our design: **deliberate reasoning for complex tasks**, **fast inference for simple ones**, optimizing computational efficiency.
<img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/thinking_ratio_chart.png" alt="Metis-RISE Framework Overview" style="width:850px; max-width:100%;">
### Benchmarks
<table>
<thead>
<tr>
<th rowspan="2" style="text-align:left; vertical-align:bottom;">Model</th>
<th colspan="7" style="text-align:center; border-bottom:1px solid #ccc;">Reasoning</th>
<th style="text-align:center; border-bottom:1px solid #ccc;">General</th>
</tr>
<tr>
<th>MathVista</th>
<th>MathVision</th>
<th>MathVerse</th>
<th>DynaMath</th>
<th>WeMath</th>
<th>LogicVista</th>
<th>Avg.</th>
<th>Avg.</th>
</tr>
</thead>
<tbody>
<tr style="background-color: #e0e0e0;">
<td colspan="9" align="center"><strong><em>Proprietary Models</em></strong></td>
</tr>
<tr>
<td>Gemini-2.0-Pro</td>
<td>71.3</td>
<td>48.1</td>
<td>67.3</td>
<td>43.3</td>
<td>56.5</td>
<td>53.2</td>
<td>56.6</td>
<td>73.3</td>
</tr>
<tr>
<td>Gemini-2.0-Flash</td>
<td>70.4</td>
<td>43.6</td>
<td>47.8</td>
<td>42.1</td>
<td>47.4</td>
<td>52.3</td>
<td>50.6</td>
<td>72.6</td>
</tr>
<tr>
<td>Claude 3.7 Sonnet</td>
<td>66.8</td>
<td>41.9</td>
<td>46.7</td>
<td>39.7</td>
<td>49.3</td>
<td>58.2</td>
<td>50.4</td>
<td>70.1</td>
</tr>
<tr>
<td>ChatGPT-4o</td>
<td>60.0</td>
<td>31.2</td>
<td>40.6</td>
<td>34.5</td>
<td>45.8</td>
<td>52.8</td>
<td>44.2</td>
<td>72.0</td>
</tr>
<tr style="background-color: #e0e0e0;">
<td colspan="9" align="center"><strong><em>Open-source Models</em></strong></td>
</tr>
<tr>
<td>LLaVA-OneVision-72B</td>
<td>67.1</td>
<td>25.3</td>
<td>27.2</td>
<td>15.6</td>
<td>32.0</td>
<td>40.9</td>
<td>34.7</td>
<td>68.0</td>
</tr>
<tr>
<td>Kimi-VL-A3B-Instruct</td>
<td>66.0</td>
<td>21.8</td>
<td>34.1</td>
<td>18.0</td>
<td>32.3</td>
<td>42.7</td>
<td>35.8</td>
<td>69.1</td>
</tr>
<tr>
<td>InternVL3-8B</td>
<td>70.5</td>
<td>30.0</td>
<td>38.5</td>
<td>25.7</td>
<td>39.5</td>
<td>44.5</td>
<td>41.4</td>
<td>73.6</td>
</tr>
<tr>
<td>VL-Rethinker-7B</td>
<td>75.5</td>
<td>29.3</td>
<td>47.2</td>
<td>25.4</td>
<td>37.8</td>
<td>47.0</td>
<td>43.7</td>
<td>68.3</td>
</tr>
<tr>
<td>Metis-RISE-7B</td>
<td>75.8</td>
<td>28.7</td>
<td>51.0</td>
<td>27.7</td>
<td>45.2</td>
<td>49.7</td>
<td>46.4</td>
<td>68.4</td>
</tr>
<tr>
<td style="border-top: 1px solid #000;">Baseline</td>
<td style="border-top: 1px solid #000;">67.4</td>
<td style="border-top: 1px solid #000;">26.2</td>
<td style="border-top: 1px solid #000;">41.1</td>
<td style="border-top: 1px solid #000;">20.2</td>
<td style="border-top: 1px solid #000;">34.5</td>
<td style="border-top: 1px solid #000;">45.6</td>
<td style="border-top: 1px solid #000; background-color: #fff2cc;">39.2</td>
<td style="border-top: 1px solid #000;">70.3</td>
</tr>
<tr>
<td>Baseline+RL</td>
<td>72.8</td>
<td>28.7</td>
<td>46.8</td>
<td>26.2</td>
<td>43.3</td>
<td>46.5</td>
<td>44.0</td>
<td style="background-color: #e1d5e7;">67.2</td>
</tr>
<tr>
<td><b>Metis-HOME</b></td>
<td>76.0</td>
<td>29.5</td>
<td>47.7</td>
<td>26.4</td>
<td>45.6</td>
<td>51.5</td>
<td style="background-color: #fff2cc;"><b>46.1</b></td>
<td style="background-color: #e1d5e7;"><b>71.2</b></td>
</tr>
</tbody>
</table>
## 🔍 Usage Example
You can use the demo inference script in the `examples` folder:
```bash
python examples/demo_inference.py
```
## 📌 Acknowledgement
We sincerely appreciate [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [MM-EUREKA](https://github.com/ModalMinds/MM-EUREKA) for providing reference training framework.
## 📖 Citation
```bibtex
@article{lan2025metis,
title={Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning},
author={Lan, Xiaohan and Liu, Fanfan and Qiu, Haibo and Yang, Siqi and Ruan, Delian and Shi, Peng and Ma, Lin},
journal={arXiv preprint arXiv:2510.20519},
year={2025}
}
```