| | --- |
| | base_model: |
| | - XiaomiMiMo/MiMo-Embodied |
| | library_name: transformers |
| | license: mit |
| | --- |
| | |
| | <div align="center"> |
| | <img src="./assets/xfmlogo.svg" width=600> |
| | </div> |
| |
|
| | <br/> |
| |
|
| | <div align="center" style="line-height: 1;"> |
| | | |
| | <a href="https://huggingface.co/XiaomiMiMo/MiMo-Embodied-7B" target="_blank">🤗 HuggingFace</a> |
| | | |
| | <a href="https://arxiv.org/abs/2511.16518" target="_blank">📔 Technical Report</a> |
| | | |
| | <br/> |
| | </div> |
| |
|
| | ## I. Introduction |
| |
|
| | **MiMo-Embodied**, a powerful cross-embodied vision-language model that shows state-of-the-art performance in both **autonomous driving** and **embodied AI tasks**, the first open-source VLM that integrates these two critical areas, significantly enhancing understanding and reasoning in dynamic physical environments. |
| |
|
| | <div align="center"> |
| | <img src="./assets/fig1.svg" width=800> |
| | </div> |
| |
|
| |
|
| | ## II. Model Capabilities |
| |
|
| | <div align="center"> |
| | <img src="./assets/fig2.svg" width=800> |
| | </div> |
| |
|
| | ## III. Model Details |
| |
|
| | <div align="center"> |
| | <img src="./assets/fig3_img.png" width=800> |
| | </div> |
| |
|
| | ## IV. Evaluation Results |
| |
|
| | MiMo-Embodied demonstrates superior performance across **17 benchmarks in three key embodied AI capabilities: Task Planning, Affordance Prediction, and Spatial Understanding**, significantly surpassing existing open-source embodied VLM models and rivaling closed-source models. |
| |
|
| | Additionally, MiMo-Embodied excels in **12 autonomous driving benchmarks across three key capabilities: Environmental Perception, Status Prediction, and Driving Planning**—significantly outperforming both existing open-source and closed-source VLM models, as well as proprietary VLM models. |
| |
|
| | Moreover, evaluation on **8 general visual understanding benchmarks** confirms that MiMo-Embodied retains and even strengthens its general capabilities, showing that domain-specialized training enhances rather than diminishes overall model proficiency. |
| |
|
| | ### Embodied AI Benchmarks |
| |
|
| | #### Affordance & Planning |
| |
|
| | <div align="center"> |
| | <img src="./assets/table2.png" width=800> |
| | </div> |
| |
|
| | #### Spatial Understanding |
| |
|
| | <div align="center"> |
| | <img src="./assets/table3.png" width=800> |
| | </div> |
| |
|
| |
|
| | ### Autonomous Driving Benchmarks |
| |
|
| | #### Single-View Image & Multi-View Video |
| |
|
| | <div align="center"> |
| | <img src="./assets/table4.png" width=800> |
| | </div> |
| |
|
| |
|
| | #### Multi-View Image & Single-View Video |
| |
|
| | <div align="center"> |
| | <img src="./assets/table5.png" width=800> |
| | </div> |
| |
|
| | ### General Visual Understanding Benchmarks |
| |
|
| | <div align="center"> |
| | <img src="./assets/table8.png" width=800> |
| | </div> |
| |
|
| | > Results marked with \* are obtained using our evaluation framework. |
| |
|
| |
|
| | ## V. Case Visualization |
| |
|
| | ### Embodied AI |
| |
|
| | #### Affordance Prediction |
| |
|
| | <div align="center"> |
| | <img src="./assets/afford-1.svg" width=800> |
| | </div> |
| |
|
| | #### Task Planning |
| |
|
| | <div align="center"> |
| | <img src="./assets/planning-1.svg" width=800> |
| | </div> |
| |
|
| | #### Spatial Understanding |
| |
|
| | <div align="center"> |
| | <img src="./assets/spatial-1.svg" width=800> |
| | </div> |
| |
|
| | ### Autonomous Driving |
| |
|
| | #### Environmental Perception |
| |
|
| | <div align="center"> |
| | <img src="./assets/ad-perception-1.svg" width=800> |
| | </div> |
| |
|
| | #### Status Prediction |
| |
|
| | <div align="center"> |
| | <img src="./assets/ad-prediction-1.png" width=800> |
| | </div> |
| |
|
| | #### Driving Planning |
| |
|
| | <div align="center"> |
| | <img src="./assets/ad-planning-1.png" width=800> |
| | </div> |
| |
|
| | ### Real-world Tasks |
| |
|
| | #### Embodied Navigation |
| |
|
| | <div align="center"> |
| | <img src="./assets/figure_navigation.svg" width=800> |
| | </div> |
| |
|
| | #### Embodied Manipulation |
| |
|
| | <div align="center"> |
| | <img src="./assets/figure_manipulation.svg" width=800> |
| | </div> |
| |
|
| |
|
| | ## VI. Citation |
| |
|
| | ```bibtex |
| | @misc{hao2025mimoembodiedxembodiedfoundationmodel, |
| | title={MiMo-Embodied: X-Embodied Foundation Model Technical Report}, |
| | author={Xiaomi Embodied Intelligence Team}, |
| | year={2025}, |
| | eprint={2511.16518}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.RO}, |
| | url={https://arxiv.org/abs/2511.16518}, |
| | } |
| | ``` |