File size: 3,168 Bytes
52311f9 9391a15 52311f9 44adca3 52311f9 44adca3 52311f9 c4014a6 52311f9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | ---
license: apache-2.0
---
<p align="center">
<img src="https://github.com/alibaba-damo-academy/RynnEC/blob/main/assets/logo.jpg?raw=true" width="150" style="margin-bottom: 0.2;"/>
<p>
<h3 align="center"><a href="" style="color:#9C276A">
RynnEC: Bringing MLLMs into Embodied World</a></h3>
<h5 align="center"> If our project helps you, please give us a star โญ on <a href="https://github.com/alibaba-damo-academy/RynnEC">Github</a> to support us. ๐๐ </h2>
## ๐ฐ News
* **[2025.08.17]** ๐ค RynnEC-7B model checkpoint has been released in Huggingface.
* **[2025.08.08]** ๐ฅ๐ฅ Release our RynnEC-2B model, RynnEC-Bench and training code.
## ๐ Introduction
**RynnEC** is a video multi-modal large language model (MLLM) specifically designed for embodied cognition
tasks.
<p align="center">
<img src="https://github.com/alibaba-damo-academy/RynnEC/blob/main/assets/radar.png?raw=true" width="100%" style="margin-bottom: 0.2;"/>
<p>
## ๐Architecture
**RynnEC** can handle a variety of input types, including images, videos, visual prompts, and task instructions. Visual inputs are processed using a Vision Encoder equipped with an any-resolution strategy, while visual prompts are handled by a region encoder to extract fine-grained features. Textual inputs are seamlessly converted into a unified token stream through tokenization. For video segmentation tasks, a mask decoder is employed to transform the output segmentation embeddings into binary masks, ensuring precise and effective results.
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/67fcc97cede5c434e0cc37e3/FEdKco-A0nitu4drJZTDk.png" width="100%" style="margin-bottom: 0.2;"/>
<p>
## ๐ Model Zoo
| Model | Base Model | HF Link |
| -------------------- | ------------ | ------------------------------------------------------------ |
| RynnEC-2B | Qwen2.5-1.5B-Instruct | [Alibaba-DAMO-Academy/RynnEC-2B](https://huggingface.co/Alibaba-DAMO-Academy/RynnEC-2B) |
| RynnEC-7B | Qwen2.5-7B-Instruct | [Alibaba-DAMO-Academy/RynnEC-7B](https://huggingface.co/Alibaba-DAMO-Academy/RynnEC-7B) |
## ๐ Main Results
Benchmark comparison across object cognition and spatial cognition. With a highly efficient **2B**-parameter architecture, **RynnEC-2B** achieves state-of-the-art (SOTA) performance on complex spatial cognition tasks.
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/67fcc97cede5c434e0cc37e3/XXmvypGmuiY9MJ6eYh9LL.png" width="100%" style="margin-bottom: 0.2;"/>
<p>
## ๐ Citation
If you find RynnEC useful for your research and applications, please cite using this BibTeX:
```bibtex
@misc{dang2025rynnecbringingmllmsembodied,
title={RynnEC: Bringing MLLMs into Embodied World},
author={Ronghao Dang and Yuqian Yuan and Yunxuan Mao and Kehan Li and Jiangpin Liu and Zhikai Wang and Xin Li and Fan Wang and Deli Zhao},
year={2025},
eprint={2508.14160},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.14160},
}
```
|