WorldVLA / README.md

Update README.md

5ec830d verified 8 months ago

6.35 kB

	---
	license: apache-2.0
	base_model:
	- facebook/chameleon-7b
	tags:
	- VLA
	- Robotics
	---
	<p align="center">
	<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/logo.png?raw=true" width="80" style="margin-bottom: 0.1;"/>
	<p>

	<h3 align="center"><a href="https://github.com/alibaba-damo-academy/WorldVLA/tree/main" style="color:#9C276A">
	WorldVLA: Towards Autoregressive Action World Model</a></h3>
	<h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏 </h2>


	<h5 align="center">

	[![arXiv](https://img.shields.io/badge/Arxiv-2506.21539-AD1C18.svg?logo=arXiv)](https://arxiv.org/pdf/2506.21539)
	[![GitHub](https://img.shields.io/badge/GitHub-WorldVLA-9cf?logo=github)](https://github.com/alibaba-damo-academy/WorldVLA)
	[![hf_checkpoint](https://img.shields.io/badge/🤗-Checkpoints-9C276A.svg)](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA)
	[![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/LICENSE)
	</h5>


	## 🌟 Introduction
	WorldVLA is an autoregressive action world model that unifies action and image understanding and generation. WorldVLA intergrates Vision-Language-Action (VLA) model (action model) and world model in one single framework.

	<div style="text-align: center;">
	<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/overview.png" style="max-width: 100%; height: auto; display: block; margin: 0 auto;">
	</div>
	<br>

	### Action Model Results (Text + Image -> Action)
	Action Model generates actions given the text instruction and image observations.

	<table>
	<tr>
	<td width="300">
	<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_open_the_middle_drawer_of_the_cabinet.gif" width="100%">
	</td>
	<td width="300">
	<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_pick_up_the_alphabet_soup_and_place_it_in_the_bask.gif" width="100%">
	</td>
	<td width="300">
	<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_pick_up_the_black_bowl_between_the_plate_and_the_r.gif" width="100%">
	</td>
	</tr>
	<tr>
	<td><center>Input: Open the middle drawer of the cabinet.</center></td>
	<td><center>Input: Pick up the alphabet soup and place it in the basket.</center></td>
	<td><center>Input: Pick up the black bowl between the plate and the ramekin and place it on the plate.</center></td>
	</tr>
	</table>

	### World Model Results (Action + Image -> Image)
	World Model generates the next frame given the current frame and action control.

	<table>
	<tr>
	<td width="300" align="center">
	<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_open_the_top_drawer_and_put_the_bowl_inside.gif" width="100%">
	</td>
	<td width="300" align="center">
	<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_push_the_plate_to_the_front_of_the_stove.gif" width="100%">
	</td>
	<td width="300" align="center">
	<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_put_the_bowl_on_the_stove.gif" width="100%">
	</td>
	</tr>
	<tr>
	<td align="center">
	Input: Action sequence of "Open the top drawer and put the bowl inside".
	</td>
	<td align="center">
	Input: Action sequence of "Push the plate to the front of the stove".
	</td>
	<td align="center">
	Input: Action sequence of "Put the bowl on the stove".
	</td>
	</tr>
	</table>


	## Model Zoo

	\| Model (256 * 256) \| HF Link \| Success Rate (%) \|
	\| :--------------------: \| :------------------------------------------------------------: \| :--------------------: \|
	\| LIBERO-Spatial \| [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_spatial](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_spatial) \| 85.6 \|
	\| LIBERO-Object \| [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_object) \| 89.0 \|
	\| LIBERO-Goal \| [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_goal) \| 82.6 \|
	\| LIBERO-Long \| [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_10) \| 59.0 \|
	<br>

	\| Model (512 * 512) \| HF Link \| Success Rate (%) \|
	\| :--------------------: \| :------------------------------------------------------------: \| :--------------------: \|
	\| LIBERO-Spatial \| [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_spatial](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_spatial) \| 87.6 \|
	\| LIBERO-Object \| [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_object) \| 96.2 \|
	\| LIBERO-Goal \| [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_goal) \| 83.4 \|
	\| LIBERO-Long \| [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_10) \| 60.0 \|


	## Citation <a name="citation"></a>
	If you find the project helpful for your research, please consider citing our paper:
	```bibtex
	@article{cen2025worldvla,
	title={WorldVLA: Towards Autoregressive Action World Model},
	author={Cen, Jun and Yu, Chaohui and Yuan, Hangjie and Jiang, Yuming and Huang, Siteng and Guo, Jiayan and Li, Xin and Song, Yibing and Luo, Hao and Wang, Fan and others},
	journal={arXiv preprint arXiv:2506.21539},
	year={2025}
	}
	```

	## Acknowledgment <a name="acknowledgment"></a>
	This project builds upon [Lumina-mGPT](https://github.com/Alpha-VLLM/Lumina-mGPT), [Chemeleon](https://github.com/facebookresearch/chameleon), and [OpenVLA](http://github.com/openvla/openvla). We thank these teams for their open-source contributions.