--- license: mit pipeline_tag: robotics library_name: transformers --- # Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process This repository contains the UD-VLA checkpoint for the CALVIN ABCD->D benchmark. Vision-language-action (VLA) models aim to understand natural language instructions and visual observations and to execute corresponding actions as an embodied agent. UD-VLA addresses limitations in prior VLA models by optimizing generation and action jointly through a synchronous denoising process, where iterative refinement enables actions to evolve from initialization under constant visual guidance. This approach, grounded in its proposed Joint Discrete Denoising Diffusion Process (JD3P), integrates multiple modalities into a single denoising trajectory, enabling understanding, generation, and acting to be intrinsically synergistic. - **Paper**: [Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process](https://arxiv.org/abs/2511.01718) - **Project page**: [https://irpn-eai.github.io/UD-VLA.github.io](https://irpn-eai.github.io/UD-VLA.github.io) - **Code**: [https://github.com/OpenHelix-Team/UD-VLA](https://github.com/OpenHelix-Team/UD-VLA) ## Installation and Evaluation Example To get started with UD-VLA and run an evaluation, follow these steps directly from the GitHub repository: ### 1. Install Base environment. ```bash # Create and activate conda environment conda create -n udvla-calvin python=3.10 -y conda activate udvla-calvin # Clone and install the openvla repo git clone https://github.com/OpenHelix-Team/UD-VLA.git cd UD-VLA pip install -r requirements.txt ``` ### 2. Install Calvin environment (for evaluation). This setup is only for evaluation. The following steps are required to set up the environment: ```shell # Install dependencies cd reference/RoboVLMs # This will install the required environment and download the calvin dataset. bash scripts/setup_calvin.sh # Only for rendering environment. bash scripts/setup_calvin_vla.sh # Check if the environment is set up correctly python eval/calvin/env_test.py ``` ### 3. Run Model Evaluation You can also download the pre-trained checkpoint fituned on CALVIN ABCD→D at [UD-VLA_CALVIN](https://huggingface.co/chenpyyy/UD-VLA_CALVIN-ABCD). ```shell cd reference/RoboVLMs # 4 GPUs inference, we set diffusion step to 72. bash scripts/run_eval_calvin_univla_i2ia_dis.sh # The above command will generate 4 results in the `results` folder, calculate the final average score python tools/evaluation/calvin_score.py ``` ## Experiment Result ### Performance on CALVIN ABCD→D Benchmark. *UniVLA** denotes the variant without historical frames for fair comparison. We evaluate 500 rollouts for our model, where each rollout involves a sequence of 5 consecutive sub-tasks.

Method	Task	1	2	3	4	5	Avg. Len ↑
MCIL	ABCD→D	0.373	0.027	0.002	0.000	0.000	0.40
RT-1	ABCD→D	0.844	0.617	0.438	0.323	0.227	2.45
Robo-Flamingo	ABCD→D	0.964	0.896	0.824	0.740	0.660	4.09
GR-1	ABCD→D	0.949	0.896	0.844	0.789	0.731	4.21
ReconVLA	ABCD→D	0.980	0.900	0.845	0.785	0.705	4.23
UniVLA*	ABCD→D	0.948	0.906	0.862	0.834	0.690	4.24
UP-VLA	ABCD→D	0.962	0.921	0.879	0.842	0.812	4.42
UD-VLA (ours)	ABCD→D	0.992	0.968	0.936	0.904	0.840	4.64

### Performance on Real-world. Our real-world setup consists of a 6-DoF UR5e robotic arm equipped with a 6-DoF Inspire RH56E2 robotic hand for dexterous manipulation. ![](real-world.png) ## Other Simulation Benchmark Setup - [LIBERO](docs/libero.md) - [SimplerEnv](docs/simpler.md) ## ❤️ Acknowledgment We thank [UniVLA](https://github.com/baaivision/UniVLA), [Emu3](https://github.com/baaivision/Emu3), [RobotVLM](https://github.com/Robot-VLAs/RoboVLMs) and [Show-o](https://github.com/showlab/Show-o) for their open-sourced work! We thank [Yuqi Wang](https://github.com/Robertwyq) and [Zhide zhong](https://scholar.google.com/citations?user=msy4tL4AAAAJ&hl=zh-CN) for their guidance about experiment! ## 📖 Citation If you find UD-VLA useful, please consider citing our work🤗: ```bibtex @article{udvla2025, title={Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process}, author={Jiayi Chen, Wenxuan Song, Pengxiang Ding, Ziyang Zhou, Han Zhao, Feilong Tang, Donglin Wang, Haoang Li}, year={2025}, journal={arXiv preprint arXiv:2511.01718} } ```