Enhance model card with robotics pipeline tag, Transformers library name, sample usage, and enriched description (#1)
Browse files- Enhance model card with robotics pipeline tag, Transformers library name, sample usage, and enriched description (8ba3ce3600eb60c4044e515de7705964d0bc3925)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -1,10 +1,113 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
-
This repository contains the UD-VLA checkpoint for the CALVIN ABCD->D benchmark, as described in Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process
|
| 5 |
|
| 6 |
-
|
| 7 |
|
| 8 |
-
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
+
pipeline_tag: robotics
|
| 4 |
+
library_name: transformers
|
| 5 |
---
|
|
|
|
| 6 |
|
| 7 |
+
# Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process
|
| 8 |
|
| 9 |
+
This repository contains the UD-VLA checkpoint for the CALVIN ABCD->D benchmark.
|
| 10 |
|
| 11 |
+
Vision-language-action (VLA) models aim to understand natural language instructions and visual observations and to execute corresponding actions as an embodied agent. UD-VLA addresses limitations in prior VLA models by optimizing generation and action jointly through a synchronous denoising process, where iterative refinement enables actions to evolve from initialization under constant visual guidance. This approach, grounded in its proposed Joint Discrete Denoising Diffusion Process (JD3P), integrates multiple modalities into a single denoising trajectory, enabling understanding, generation, and acting to be intrinsically synergistic.
|
| 12 |
+
|
| 13 |
+
- **Paper**: [Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process](https://arxiv.org/abs/2511.01718)
|
| 14 |
+
- **Project page**: [https://irpn-eai.github.io/UD-VLA.github.io](https://irpn-eai.github.io/UD-VLA.github.io)
|
| 15 |
+
- **Code**: [https://github.com/OpenHelix-Team/UD-VLA](https://github.com/OpenHelix-Team/UD-VLA)
|
| 16 |
+
|
| 17 |
+
## Installation and Evaluation Example
|
| 18 |
+
|
| 19 |
+
To get started with UD-VLA and run an evaluation, follow these steps directly from the GitHub repository:
|
| 20 |
+
|
| 21 |
+
### 1. Install Base environment.
|
| 22 |
+
|
| 23 |
+
```bash
|
| 24 |
+
# Create and activate conda environment
|
| 25 |
+
conda create -n udvla-calvin python=3.10 -y
|
| 26 |
+
conda activate udvla-calvin
|
| 27 |
+
# Clone and install the openvla repo
|
| 28 |
+
git clone https://github.com/OpenHelix-Team/UD-VLA.git
|
| 29 |
+
cd UD-VLA
|
| 30 |
+
pip install -r requirements.txt
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
### 2. Install Calvin environment (for evaluation).
|
| 34 |
+
|
| 35 |
+
This setup is only for evaluation. The following steps are required to set up the environment:
|
| 36 |
+
|
| 37 |
+
```shell
|
| 38 |
+
# Install dependencies
|
| 39 |
+
cd reference/RoboVLMs
|
| 40 |
+
|
| 41 |
+
# This will install the required environment and download the calvin dataset.
|
| 42 |
+
bash scripts/setup_calvin.sh
|
| 43 |
+
|
| 44 |
+
# Only for rendering environment.
|
| 45 |
+
bash scripts/setup_calvin_vla.sh
|
| 46 |
+
|
| 47 |
+
# Check if the environment is set up correctly
|
| 48 |
+
python eval/calvin/env_test.py
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
### 3. Run Model Evaluation
|
| 52 |
+
|
| 53 |
+
You can also download the pre-trained checkpoint fituned on CALVIN ABCD→D at [UD-VLA_CALVIN](https://huggingface.co/chenpyyy/UD-VLA_CALVIN-ABCD).
|
| 54 |
+
|
| 55 |
+
```shell
|
| 56 |
+
cd reference/RoboVLMs
|
| 57 |
+
|
| 58 |
+
# 4 GPUs inference, we set diffusion step to 72.
|
| 59 |
+
bash scripts/run_eval_calvin_univla_i2ia_dis.sh
|
| 60 |
+
|
| 61 |
+
# The above command will generate 4 results in the `results` folder, calculate the final average score
|
| 62 |
+
python tools/evaluation/calvin_score.py
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
## Experiment Result
|
| 66 |
+
### Performance on CALVIN ABCD→D Benchmark.
|
| 67 |
+
*UniVLA** denotes the variant without historical frames for fair comparison. We evaluate 500 rollouts for our model, where each rollout involves a sequence of 5 consecutive sub-tasks.
|
| 68 |
+
<div align="center">
|
| 69 |
+
|
| 70 |
+
<table>
|
| 71 |
+
<thead>
|
| 72 |
+
<tr>
|
| 73 |
+
<th>Method</th><th>Task</th><th>1</th><th>2</th><th>3</th><th>4</th><th>5</th><th>Avg. Len ↑</th>
|
| 74 |
+
</tr>
|
| 75 |
+
</thead>
|
| 76 |
+
<tbody>
|
| 77 |
+
<tr><td>MCIL</td><td>ABCD→D</td><td>0.373</td><td>0.027</td><td>0.002</td><td>0.000</td><td>0.000</td><td>0.40</td></tr>
|
| 78 |
+
<tr><td>RT-1</td><td>ABCD→D</td><td>0.844</td><td>0.617</td><td>0.438</td><td>0.323</td><td>0.227</td><td>2.45</td></tr>
|
| 79 |
+
<tr><td>Robo-Flamingo</td><td>ABCD→D</td><td>0.964</td><td>0.896</td><td>0.824</td><td>0.740</td><td>0.660</td><td>4.09</td></tr>
|
| 80 |
+
<tr><td>GR-1</td><td>ABCD→D</td><td>0.949</td><td>0.896</td><td>0.844</td><td>0.789</td><td>0.731</td><td>4.21</td></tr>
|
| 81 |
+
<tr><td>ReconVLA</td><td>ABCD→D</td><td>0.980</td><td>0.900</td><td>0.845</td><td>0.785</td><td>0.705</td><td>4.23</td></tr>
|
| 82 |
+
<tr><td>UniVLA*</td><td>ABCD→D</td><td>0.948</td><td>0.906</td><td>0.862</td><td>0.834</td><td>0.690</td><td>4.24</td></tr>
|
| 83 |
+
<tr><td>UP-VLA</td><td>ABCD→D</td><td>0.962</td><td>0.921</td><td>0.879</td><td>0.842</td><td>0.812</td><td>4.42</td></tr>
|
| 84 |
+
<tr><td><b>UD-VLA (ours)</b></td><td>ABCD→D</td><td><b>0.992</b></td><td><b>0.968</b></td><td><b>0.936</b></td><td><b>0.904</b></td><td><b>0.840</b></td><td><b>4.64</b></td></tr>
|
| 85 |
+
</tbody>
|
| 86 |
+
</table>
|
| 87 |
+
</div>
|
| 88 |
+
|
| 89 |
+
### Performance on Real-world.
|
| 90 |
+
Our real-world setup consists of a 6-DoF UR5e robotic arm equipped with a 6-DoF Inspire RH56E2 robotic hand for dexterous manipulation.
|
| 91 |
+

|
| 92 |
+
|
| 93 |
+
## Other Simulation Benchmark Setup
|
| 94 |
+
- [LIBERO](docs/libero.md)
|
| 95 |
+
- [SimplerEnv](docs/simpler.md)
|
| 96 |
+
|
| 97 |
+
## ❤️ Acknowledgment
|
| 98 |
+
|
| 99 |
+
We thank [UniVLA](https://github.com/baaivision/UniVLA), [Emu3](https://github.com/baaivision/Emu3), [RobotVLM](https://github.com/Robot-VLAs/RoboVLMs) and [Show-o](https://github.com/showlab/Show-o) for their open-sourced work!
|
| 100 |
+
|
| 101 |
+
We thank [Yuqi Wang](https://github.com/Robertwyq) and [Zhide zhong](https://scholar.google.com/citations?user=msy4tL4AAAAJ&hl=zh-CN) for their guidance about experiment!
|
| 102 |
+
|
| 103 |
+
## 📖 Citation
|
| 104 |
+
If you find UD-VLA useful, please consider citing our work🤗:
|
| 105 |
+
|
| 106 |
+
```bibtex
|
| 107 |
+
@article{udvla2025,
|
| 108 |
+
title={Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process},
|
| 109 |
+
author={Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, Haoang Li},
|
| 110 |
+
year={2025},
|
| 111 |
+
journal={arXiv preprint arXiv:2511.01718}
|
| 112 |
+
}
|
| 113 |
+
```
|