--- license: apache-2.0 ---

Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning

[![arXiv](https://img.shields.io/badge/Arxiv-2510.20519-b31b1b.svg?logo=arXiv)](https://arxiv.org/pdf/2510.20519)  [![Code License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)
## 💡 Overview Current multimodal reasoning models face a critical dilemma: they often "overthink" on simple tasks (inefficiency) and suffer from general capability degradation when optimized for reasoning. We introduce **Metis-HOME** (**H**ybrid **O**ptimized **M**ixture-of-**E**xperts), a novel framework that enables a "Hybrid Thinking" paradigm. By structuring the original dense model (Qwen2.5-VL-7B) into two distinct expert branches—a Thinking Expert for complex reasoning and a Non-Thinking Expert for rapid inference—controlled by a lightweight router, Metis-HOME effectively resolves the reasoning-vs-generalization trade-off.
Metis-RISE Framework Overview Metis-RISE Framework Overview
## ✨ Highlights - 🧠 Hybrid Thinking Paradigm: Explicitly decouples "System 1" (fast, intuitive) and "System 2" (slow, deliberative) reasoning within a unified multimodal MoE architecture. - 🔄 Router Mechanism: A lightweight, trainable router dynamically allocates queries based on complexity, avoiding computational waste on simple tasks like OCR or Captioning. - 🚀 Performance: - +6.9% improvement on reasoning benchmarks (MathVista, etc.) compared to the baseline. - ~1% gain on general benchmarks, reversing the degradation trend observed in other reasoning-specialized models. - 🛠️ Efficient Training: A multi-stage strategy combining Reinforcement Learning (RL) for reasoning enhancement and Mixed Supervised Fine-Tuning (SFT) for expert specialization. ## 📊 Results ### Thinking Ratio As shown in the following figure, the **thinking ratio** analysis of Metis-HOME reveals adaptive routing behavior: - **High ratios (78\%–98\%)** on reasoning-heavy benchmarks (*WeMath*, *MathVision*, etc.), indicating effective use of the *thinking expert* for multi-step inference. - **Low ratios (2\%–5\%)** on general benchmarks (*MMBench*, *OCRBench*), showing preference for the *non-thinking expert*. This aligns with our design: **deliberate reasoning for complex tasks**, **fast inference for simple ones**, optimizing computational efficiency. Metis-RISE Framework Overview ### Benchmarks
Model Reasoning General
MathVista MathVision MathVerse DynaMath WeMath LogicVista Avg. Avg.
Proprietary Models
Gemini-2.0-Pro 71.3 48.1 67.3 43.3 56.5 53.2 56.6 73.3
Gemini-2.0-Flash 70.4 43.6 47.8 42.1 47.4 52.3 50.6 72.6
Claude 3.7 Sonnet 66.8 41.9 46.7 39.7 49.3 58.2 50.4 70.1
ChatGPT-4o 60.0 31.2 40.6 34.5 45.8 52.8 44.2 72.0
Open-source Models
LLaVA-OneVision-72B 67.1 25.3 27.2 15.6 32.0 40.9 34.7 68.0
Kimi-VL-A3B-Instruct 66.0 21.8 34.1 18.0 32.3 42.7 35.8 69.1
InternVL3-8B 70.5 30.0 38.5 25.7 39.5 44.5 41.4 73.6
VL-Rethinker-7B 75.5 29.3 47.2 25.4 37.8 47.0 43.7 68.3
Metis-RISE-7B 75.8 28.7 51.0 27.7 45.2 49.7 46.4 68.4
Baseline 67.4 26.2 41.1 20.2 34.5 45.6 39.2 70.3
Baseline+RL 72.8 28.7 46.8 26.2 43.3 46.5 44.0 67.2
Metis-HOME 76.0 29.5 47.7 26.4 45.6 51.5 46.1 71.2
## 🔍 Usage Example You can use the demo inference script in the `examples` folder: ```bash python examples/demo_inference.py ``` ## 📌 Acknowledgement We sincerely appreciate [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) and [MM-EUREKA](https://github.com/ModalMinds/MM-EUREKA) for providing reference training framework. ## 📖 Citation ```bibtex @article{lan2025metis, title={Metis-HOME: Hybrid Optimized Mixture-of-Experts for Multimodal Reasoning}, author={Lan, Xiaohan and Liu, Fanfan and Qiu, Haibo and Yang, Siqi and Ruan, Delian and Shi, Peng and Ma, Lin}, journal={arXiv preprint arXiv:2510.20519}, year={2025} } ```