File size: 7,390 Bytes

9b7f4b6

---
license: apache-2.0
---

<h1 align="center">Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning</h1>

<h5 align="center">

[![arXiv](https://img.shields.io/badge/Arxiv-2510.20519-b31b1b.svg?logo=arXiv)](https://arxiv.org/pdf/2510.20519)&ensp;<a href='https://huggingface.co/mmthinking/Metis-HOME'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face%20-models-blue'></a>&ensp;[![Code License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)

</h5>


## 💡 Overview
Current multimodal reasoning models face a critical dilemma: they often "overthink" on simple tasks (inefficiency) and suffer from general capability degradation when optimized for reasoning. 

We introduce **Metis-HOME** (**H**ybrid **O**ptimized **M**ixture-of-**E**xperts), a novel framework that enables a "Hybrid Thinking" paradigm. By structuring the original dense model (Qwen2.5-VL-7B) into two distinct expert branches—a Thinking Expert for complex reasoning and a Non-Thinking Expert for rapid inference—controlled by a lightweight router, Metis-HOME effectively resolves the reasoning-vs-generalization trade-off.

<div style="display: flex; justify-content: center; gap: 20px; flex-wrap: wrap;">
  <img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/framework.png" alt="Metis-RISE Framework Overview" style="width:400px; max-width:100%;">
  <img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/radar_chart.png" alt="Metis-RISE Framework Overview" style="width:400px; max-width:100%;">
</div>

## ✨ Highlights

- 🧠 Hybrid Thinking Paradigm: Explicitly decouples "System 1" (fast, intuitive) and "System 2" (slow, deliberative) reasoning within a unified multimodal MoE architecture.
- 🔄 Router Mechanism: A lightweight, trainable router dynamically allocates queries based on complexity, avoiding computational waste on simple tasks like OCR or Captioning.
- 🚀 Performance:
    - +6.9% improvement on reasoning benchmarks (MathVista, etc.) compared to the baseline.
    - ~1% gain on general benchmarks, reversing the degradation trend observed in other reasoning-specialized models.

- 🛠️ Efficient Training: A multi-stage strategy combining Reinforcement Learning (RL) for reasoning enhancement and Mixed Supervised Fine-Tuning (SFT) for expert specialization.


## 📊 Results

### Thinking Ratio
As shown in the following figure, the **thinking ratio** analysis of Metis-HOME reveals adaptive routing behavior:  
- **High ratios (78\%–98\%)** on reasoning-heavy benchmarks (*WeMath*, *MathVision*, etc.), indicating effective use of the *thinking expert* for multi-step inference.  
- **Low ratios (2\%–5\%)** on general benchmarks (*MMBench*, *OCRBench*), showing preference for the *non-thinking expert*.  

This aligns with our design: **deliberate reasoning for complex tasks**, **fast inference for simple ones**, optimizing computational efficiency.  

<img src="https://raw.githubusercontent.com/MM-Thinking/Metis-HOME/main/assets/thinking_ratio_chart.png" alt="Metis-RISE Framework Overview" style="width:850px; max-width:100%;">


### Benchmarks
<table>
<thead>
  <tr>
    <th rowspan="2" style="text-align:left; vertical-align:bottom;">Model</th>
    <th colspan="7" style="text-align:center; border-bottom:1px solid #ccc;">Reasoning</th>
    <th style="text-align:center; border-bottom:1px solid #ccc;">General</th>
  </tr>
  <tr>
    <th>MathVista</th>
    <th>MathVision</th>
    <th>MathVerse</th>
    <th>DynaMath</th>
    <th>WeMath</th>
    <th>LogicVista</th>
    <th>Avg.</th>
    <th>Avg.</th>
  </tr>
</thead>
<tbody>

  <tr style="background-color: #e0e0e0;">
    <td colspan="9" align="center"><strong><em>Proprietary Models</em></strong></td>
  </tr>

  <tr>
    <td>Gemini-2.0-Pro</td>
    <td>71.3</td>
    <td>48.1</td>
    <td>67.3</td>
    <td>43.3</td>
    <td>56.5</td>
    <td>53.2</td>
    <td>56.6</td>
    <td>73.3</td>
  </tr>
  <tr>
    <td>Gemini-2.0-Flash</td>
    <td>70.4</td>
    <td>43.6</td>
    <td>47.8</td>
    <td>42.1</td>
    <td>47.4</td>
    <td>52.3</td>
    <td>50.6</td>
    <td>72.6</td>
  </tr>
  <tr>
    <td>Claude 3.7 Sonnet</td>
    <td>66.8</td>
    <td>41.9</td>
    <td>46.7</td>
    <td>39.7</td>
    <td>49.3</td>
    <td>58.2</td>
    <td>50.4</td>
    <td>70.1</td>
  </tr>
  <tr>
    <td>ChatGPT-4o</td>
    <td>60.0</td>
    <td>31.2</td>
    <td>40.6</td>
    <td>34.5</td>
    <td>45.8</td>
    <td>52.8</td>
    <td>44.2</td>
    <td>72.0</td>
  </tr>
  

  <tr style="background-color: #e0e0e0;">
    <td colspan="9" align="center"><strong><em>Open-source Models</em></strong></td>
  </tr>

  <tr>
    <td>LLaVA-OneVision-72B</td>
    <td>67.1</td>
    <td>25.3</td>
    <td>27.2</td>
    <td>15.6</td>
    <td>32.0</td>
    <td>40.9</td>
    <td>34.7</td>
    <td>68.0</td>
  </tr>
  <tr>
    <td>Kimi-VL-A3B-Instruct</td>
    <td>66.0</td>
    <td>21.8</td>
    <td>34.1</td>
    <td>18.0</td>
    <td>32.3</td>
    <td>42.7</td>
    <td>35.8</td>
    <td>69.1</td>
  </tr>
  <tr>
    <td>InternVL3-8B</td>
    <td>70.5</td>
    <td>30.0</td>
    <td>38.5</td>
    <td>25.7</td>
    <td>39.5</td>
    <td>44.5</td>
    <td>41.4</td>
    <td>73.6</td>
  </tr>
  <tr>
    <td>VL-Rethinker-7B</td>
    <td>75.5</td>
    <td>29.3</td>
    <td>47.2</td>
    <td>25.4</td>
    <td>37.8</td>
    <td>47.0</td>
    <td>43.7</td>
    <td>68.3</td>
  </tr>
  <tr>
    <td>Metis-RISE-7B</td>
    <td>75.8</td>
    <td>28.7</td>
    <td>51.0</td>
    <td>27.7</td>
    <td>45.2</td>
    <td>49.7</td>
    <td>46.4</td>
    <td>68.4</td>
  </tr>
  
  <tr>
    <td style="border-top: 1px solid #000;">Baseline</td>
    <td style="border-top: 1px solid #000;">67.4</td>
    <td style="border-top: 1px solid #000;">26.2</td>
    <td style="border-top: 1px solid #000;">41.1</td>
    <td style="border-top: 1px solid #000;">20.2</td>
    <td style="border-top: 1px solid #000;">34.5</td>
    <td style="border-top: 1px solid #000;">45.6</td>
    <td style="border-top: 1px solid #000; background-color: #fff2cc;">39.2</td>
    <td style="border-top: 1px solid #000;">70.3</td>
  </tr>
  <tr>
    <td>Baseline+RL</td>
    <td>72.8</td>
    <td>28.7</td>
    <td>46.8</td>
    <td>26.2</td>
    <td>43.3</td>
    <td>46.5</td>
    <td>44.0</td>
    <td style="background-color: #e1d5e7;">67.2</td>
  </tr>
  <tr>
    <td><b>Metis-HOME</b></td>
    <td>76.0</td>
    <td>29.5</td>
    <td>47.7</td>
    <td>26.4</td>
    <td>45.6</td>
    <td>51.5</td>
    <td style="background-color: #fff2cc;"><b>46.1</b></td>
    <td style="background-color: #e1d5e7;"><b>71.2</b></td>
  </tr>
</tbody>
</table>


## 🔍 Usage Example

You can use the demo inference script in the `examples` folder:

```bash
python examples/demo_inference.py
```

## 📌 Acknowledgement
We sincerely appreciate [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [MM-EUREKA](https://github.com/ModalMinds/MM-EUREKA) for providing reference training framework.

## 📖 Citation

```bibtex
@article{lan2025metis,
  title={Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning},
  author={Lan, Xiaohan and Liu, Fanfan and Qiu, Haibo and Yang, Siqi and Ruan, Delian and Shi, Peng and Ma, Lin},
  journal={arXiv preprint arXiv:2510.20519},
  year={2025}
}
```