Add comprehensive model card for VITA-E
#1
by
nielsr HF Staff - opened
README.md
ADDED
|
@@ -0,0 +1,195 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: robotics
|
| 4 |
+
library_name: transformers
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
|
| 8 |
+
|
| 9 |
+
<div align="center">
|
| 10 |
+
๐ <a href="https://lxysl.github.io/VITA-E/">Project Page</a> ยท ๐ <a href="https://huggingface.co/papers/2510.21817">Paper</a> ยท ๐ค <a href="https://github.com/Tencent/VITA/tree/VITA-E">Code on GitHub</a> ยท ๐ <a href="https://youtu.be/jplQ0R50kfU">Live Demo</a>
|
| 11 |
+
</div>
|
| 12 |
+
|
| 13 |
+
<p align="center">
|
| 14 |
+
| <a href="#overview"><b>๐บ๏ธ Overview</b></a>
|
| 15 |
+
| <a href="#experimental-results"><b>๐ Experimental Results</b></a>
|
| 16 |
+
| <a href="#get-started"><b>โก Get Started</b></a>
|
| 17 |
+
| <a href="#inference-demo"><b>๐ป Inference & Demo</b></a>
|
| 18 |
+
| <a href="#training"><b>๐ฅ Training</b></a>
|
| 19 |
+
|
|
| 20 |
+
</p>
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
<p align="center">
|
| 24 |
+
<img src="https://github.com/Tencent/VITA/raw/VITA-E/asset/vita-e-demo.png" width="70%" height="70%"><br>
|
| 25 |
+
VITA-E can handle various complex interactive scenarios, including concurrency and nearly real-time interruption.<br>
|
| 26 |
+
<a href="https://youtu.be/jplQ0R50kfU">๐ฝ VITA-E Demo Show! Here We Go! ๐ฅ</a><br>
|
| 27 |
+
</p>
|
| 28 |
+
|
| 29 |
+
<a id="overview"></a>
|
| 30 |
+
## ๐บ๏ธ VITA-E Overview
|
| 31 |
+
|
| 32 |
+
<table>
|
| 33 |
+
<tr>
|
| 34 |
+
<td width="320">
|
| 35 |
+
<img src="https://github.com/Tencent/VITA/raw/VITA-E/asset/vita-e-logo.png" alt="VITA-E Logo" width="300">
|
| 36 |
+
</td>
|
| 37 |
+
<td>
|
| 38 |
+
|
| 39 |
+
We are excited to present **VITA-E**, which incorporates a series of advancements:
|
| 40 |
+
|
| 41 |
+
1. **Dual-Model Framework for Seamless Interaction**. VITA-E introduces a groundbreaking dual-model core, where an "Active Model" executes tasks while a "Listening Model" stands ready for new commands.
|
| 42 |
+
|
| 43 |
+
2. **Innovative "Model-as-Controller" Paradigm**. We pioneer a "model-as-controller" approach where the Vision-Language Model is fine-tuned to generate special tokens that function as direct system-level commands, enabling precise, reliable, and immediate control over system actions.
|
| 44 |
+
|
| 45 |
+
3. **Smooth Human-Computer Interaction**. By this manner, VITA-E supports smooth two-way voice interaction, allows replies while executing, voice interruption during actions, and natural action transition. Besides, VITA-E supports both English and Chinese.
|
| 46 |
+
|
| 47 |
+
4. **Strong Performance in Critical Interactive Scenarios**. Tested on a physical humanoid robot, VITA-E demonstrated exceptional reliability and responsiveness. It achieves a high success rate across multiple interactive and operational tasks and is compatible with a wide range of mainstream VLA models.
|
| 48 |
+
|
| 49 |
+
</td>
|
| 50 |
+
</tr>
|
| 51 |
+
</table>
|
| 52 |
+
|
| 53 |
+
<a id="experimental-results"></a>
|
| 54 |
+
## ๐ Experimental Results
|
| 55 |
+
|
| 56 |
+
- **Success rate comparison of VITA-E and baseline models on two fundamental manipulation tasks.**
|
| 57 |
+
|
| 58 |
+
<p align="center">
|
| 59 |
+
<img src="https://github.com/Tencent/VITA/raw/VITA-E/asset/vita-e-results.png" width="80%" height="80%">
|
| 60 |
+
</p>
|
| 61 |
+
|
| 62 |
+
- **Key Interactive Performance.**
|
| 63 |
+
|
| 64 |
+
<div align="center">
|
| 65 |
+
|
| 66 |
+
<table>
|
| 67 |
+
<thead>
|
| 68 |
+
<tr>
|
| 69 |
+
<th>Speech Interruption</th>
|
| 70 |
+
<th>Task Switching</th>
|
| 71 |
+
<th>Emergency Stop</th>
|
| 72 |
+
<th>Avg. voice response latency</th>
|
| 73 |
+
</tr>
|
| 74 |
+
</thead>
|
| 75 |
+
<tbody>
|
| 76 |
+
<tr>
|
| 77 |
+
<td>100%</td>
|
| 78 |
+
<td>93.3%</td>
|
| 79 |
+
<td>100%</td>
|
| 80 |
+
<td>2.26s</td>
|
| 81 |
+
</tr>
|
| 82 |
+
</tbody>
|
| 83 |
+
</table>
|
| 84 |
+
|
| 85 |
+
</div>
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
<a id="get-started"></a>
|
| 89 |
+
## โก Get Started
|
| 90 |
+
|
| 91 |
+
Install conda environment.
|
| 92 |
+
|
| 93 |
+
```
|
| 94 |
+
git clone https://github.com/VITA-MLLM/VITA-E
|
| 95 |
+
cd VITA-E
|
| 96 |
+
conda create -n vita_e python=3.10 -y
|
| 97 |
+
conda activate vita_e
|
| 98 |
+
pip install --upgrade pip
|
| 99 |
+
pip install -r vita_e_requirements.txt
|
| 100 |
+
pip install flash-attn --no-build-isolation
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
Download the required model weights to local path: [VITA-E](https://huggingface.co/VITA-MLLM/VITA-E).
|
| 104 |
+
|
| 105 |
+
```bash
|
| 106 |
+
huggingface-cli download VITA-MLLM/VITA-E --local-dir checkpoints/VITA-E
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
<a id="inference-demo"></a>
|
| 110 |
+
## ๐ป Inference & Demo
|
| 111 |
+
|
| 112 |
+
### ๐ Inference
|
| 113 |
+
|
| 114 |
+
Run the inference script.
|
| 115 |
+
|
| 116 |
+
```bash
|
| 117 |
+
python inference_vita_e.py \
|
| 118 |
+
--model_path_vlm checkpoints/VITA-E/vita_vla_finetune \
|
| 119 |
+
--model_path_policy checkpoints/VITA-E/vita_gr00t_robot
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
### ๐ Demo
|
| 123 |
+
|
| 124 |
+
#### Web Demo
|
| 125 |
+
|
| 126 |
+
You can interact with our VITA-E web demo with mocked robot state data to experience the features, with no need of any embodied robot entity. (A total of 48 GB GPU memory is needed.)
|
| 127 |
+
|
| 128 |
+
Prepare a VAD (Voice Activity Detection) module.
|
| 129 |
+
You can choose to download [silero_vad.onnx](https://github.com/snakers4/silero-vad/tree/v4.0/files) and [silero_vad.jit](https://github.com/snakers4/silero-vad/tree/v4.0/files), and place these files in the `./demo/wakeup_and_vad/resource/` directory.
|
| 130 |
+
|
| 131 |
+
```bash
|
| 132 |
+
python -m vita_e.server_vla_vita \
|
| 133 |
+
--model_path_vlm checkpoints/VITA-E/vita_vla_finetune \
|
| 134 |
+
--model_path_policy checkpoints/VITA-E/vita_gr00t_robot \
|
| 135 |
+
--ip 0.0.0.0 \
|
| 136 |
+
--port 8081
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
Wait about three minutes to completely load all modules. Open `127.0.0.1:8081` website on you server and enjoy it.
|
| 140 |
+
|
| 141 |
+
#### Real Robot Demo
|
| 142 |
+
|
| 143 |
+
Deploy server script on your server.
|
| 144 |
+
|
| 145 |
+
```bash
|
| 146 |
+
python -m vita_e.server_vla_vita \
|
| 147 |
+
--model_path_vlm checkpoints/VITA-E/vita_vla_finetune \
|
| 148 |
+
--model_path_policy checkpoints/VITA-E/vita_gr00t_robot \
|
| 149 |
+
--ip 0.0.0.0 \
|
| 150 |
+
--port 8081
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
Start client script on the robot client.
|
| 154 |
+
|
| 155 |
+
```bash
|
| 156 |
+
cd demo
|
| 157 |
+
python vla_robot_client.py
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
<a id="training"></a>
|
| 161 |
+
## ๐ฅ Training
|
| 162 |
+
|
| 163 |
+
Our VITA-E model is built upon the VITA-1.5 and Isaac-GR00T architectures. We leverage VITA-1.5 as the VLM component and integrate Isaac-GR00T's pre-trained diffusion action expert as the action model.
|
| 164 |
+
|
| 165 |
+
The training process involves two stages: first, we fine-tune the VLM component and integrate it into the Isaac-GR00T framework by replacing the original VLM; then, we perform end-to-end fine-tuning on the complete model using VLA data.
|
| 166 |
+
|
| 167 |
+
Please refer to [VITA-1.5](https://github.com/VITA-MLLM/VITA) and [Isaac-GR00T](https://github.com/NVIDIA/Isaac-GR00T) for more details.
|
| 168 |
+
|
| 169 |
+
## โ๏ธ Citation
|
| 170 |
+
|
| 171 |
+
If you find our work helpful for your research, please consider citing our work.
|
| 172 |
+
|
| 173 |
+
```bibtex
|
| 174 |
+
@article{liu2025vitae,
|
| 175 |
+
title={VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting},
|
| 176 |
+
author={Xiaoyu, Liu and Chaoyou, Fu and Chi, Yan and Chu, Wu and Haihan, Gao and Yi-Fan, Zhang and Shaoqi, Dong and Cheng, Qian and Bin, Luo and Xiuyong, Yang and Guanwu, Li and Yusheng, Cai and Yunhang, Shen and Deqiang, Jiang and Haoyu, Cao and Xing, Sun and Caifeng, Shan and Ran, He},
|
| 177 |
+
journal={arXiv preprint arXiv:2510.21817},
|
| 178 |
+
year={2025}
|
| 179 |
+
}
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
|
| 183 |
+
## ๐ More Research
|
| 184 |
+
|
| 185 |
+
Explore our related researches:
|
| 186 |
+
- **[VITA-1.5]** [VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction](https://github.com/VITA-MLLM/VITA)
|
| 187 |
+
- **[VITA-1.0]** [VITA: Towards Open-Source Interactive Omni Multimodal LLM](https://vita-home.github.io/)
|
| 188 |
+
- **[Awesome-MLLM]** [A Survey on Multimodal Large Language Models](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models)
|
| 189 |
+
- **[MME]** [MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation)
|
| 190 |
+
- **[Video-MME]** [Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis](https://github.com/BradyFU/Video-MME)
|
| 191 |
+
|
| 192 |
+
|
| 193 |
+
## ๐ Acknowledgments
|
| 194 |
+
VITA-E is built with reference to the following outstanding works: [Isaac-GR00T](https://github.com/NVIDIA/Isaac-GR00T) and [Lerobot](https://github.com/huggingface/lerobot).
|
| 195 |
+
Thanks๏ผ
|