File size: 3,168 Bytes
52311f9
 
 
 
 
 
 
 
 
 
 
 
 
9391a15
52311f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44adca3
 
52311f9
 
 
 
 
 
 
 
44adca3
52311f9
 
 
 
 
 
c4014a6
 
 
 
 
 
 
 
 
 
 
52311f9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
license: apache-2.0
---
<p align="center">
    <img src="https://github.com/alibaba-damo-academy/RynnEC/blob/main/assets/logo.jpg?raw=true" width="150" style="margin-bottom: 0.2;"/>
<p>

<h3 align="center"><a href="" style="color:#9C276A">
RynnEC: Bringing MLLMs into Embodied World</a></h3>
<h5 align="center"> If our project helps you, please give us a star โญ on <a href="https://github.com/alibaba-damo-academy/RynnEC">Github</a> to support us. ๐Ÿ™๐Ÿ™ </h2>


## ๐Ÿ“ฐ News
* **[2025.08.17]**  ๐Ÿค— RynnEC-7B model checkpoint has been released in Huggingface.
* **[2025.08.08]**  ๐Ÿ”ฅ๐Ÿ”ฅ Release our RynnEC-2B model, RynnEC-Bench and training code.



## ๐ŸŒŸ Introduction
**RynnEC** is a video multi-modal large language model (MLLM) specifically designed for embodied cognition
tasks. 

<p align="center">
    <img src="https://github.com/alibaba-damo-academy/RynnEC/blob/main/assets/radar.png?raw=true" width="100%" style="margin-bottom: 0.2;"/>
<p>

## ๐Ÿ“Architecture
**RynnEC** can handle a variety of input types, including images, videos, visual prompts, and task instructions. Visual inputs are processed using a Vision Encoder equipped with an any-resolution strategy, while visual prompts are handled by a region encoder to extract fine-grained features. Textual inputs are seamlessly converted into a unified token stream through tokenization. For video segmentation tasks, a mask decoder is employed to transform the output segmentation embeddings into binary masks, ensuring precise and effective results.

<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/67fcc97cede5c434e0cc37e3/FEdKco-A0nitu4drJZTDk.png" width="100%" style="margin-bottom: 0.2;"/>
<p>
  
## ๐ŸŒŽ Model Zoo

| Model                | Base Model   | HF Link                                                      |
| -------------------- | ------------ | ------------------------------------------------------------ |
| RynnEC-2B       | Qwen2.5-1.5B-Instruct   | [Alibaba-DAMO-Academy/RynnEC-2B](https://huggingface.co/Alibaba-DAMO-Academy/RynnEC-2B) |
| RynnEC-7B       | Qwen2.5-7B-Instruct   | [Alibaba-DAMO-Academy/RynnEC-7B](https://huggingface.co/Alibaba-DAMO-Academy/RynnEC-7B) |



## ๐Ÿ“Š Main Results

Benchmark comparison across object cognition and spatial cognition. With a highly efficient **2B**-parameter architecture, **RynnEC-2B** achieves state-of-the-art (SOTA) performance on complex spatial cognition tasks.

<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/67fcc97cede5c434e0cc37e3/XXmvypGmuiY9MJ6eYh9LL.png" width="100%" style="margin-bottom: 0.2;"/>
<p>
  

## ๐Ÿ“‘ Citation

If you find RynnEC useful for your research and applications, please cite using this BibTeX:
```bibtex
@misc{dang2025rynnecbringingmllmsembodied,
      title={RynnEC: Bringing MLLMs into Embodied World}, 
      author={Ronghao Dang and Yuqian Yuan and Yunxuan Mao and Kehan Li and Jiangpin Liu and Zhikai Wang and Xin Li and Fan Wang and Deli Zhao},
      year={2025},
      eprint={2508.14160},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.14160}, 
}
```