WorldVLA / README.md
jcenaa's picture
Update README.md
5ec830d verified
---
license: apache-2.0
base_model:
- facebook/chameleon-7b
tags:
- VLA
- Robotics
---
<p align="center">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/logo.png?raw=true" width="80" style="margin-bottom: 0.1;"/>
<p>
<h3 align="center"><a href="https://github.com/alibaba-damo-academy/WorldVLA/tree/main" style="color:#9C276A">
WorldVLA: Towards Autoregressive Action World Model</a></h3>
<h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. πŸ™πŸ™ </h2>
<h5 align="center">
[![arXiv](https://img.shields.io/badge/Arxiv-2506.21539-AD1C18.svg?logo=arXiv)](https://arxiv.org/pdf/2506.21539)
[![GitHub](https://img.shields.io/badge/GitHub-WorldVLA-9cf?logo=github)](https://github.com/alibaba-damo-academy/WorldVLA)
[![hf_checkpoint](https://img.shields.io/badge/πŸ€—-Checkpoints-9C276A.svg)](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA)
[![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/LICENSE)
</h5>
## 🌟 Introduction
WorldVLA is an autoregressive action world model that unifies action and image understanding and generation. WorldVLA intergrates Vision-Language-Action (VLA) model (action model) and world model in one single framework.
<div style="text-align: center;">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/overview.png" style="max-width: 100%; height: auto; display: block; margin: 0 auto;">
</div>
<br>
### Action Model Results (Text + Image -> Action)
Action Model generates actions given the text instruction and image observations.
<table>
<tr>
<td width="300">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_open_the_middle_drawer_of_the_cabinet.gif" width="100%">
</td>
<td width="300">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_pick_up_the_alphabet_soup_and_place_it_in_the_bask.gif" width="100%">
</td>
<td width="300">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_pick_up_the_black_bowl_between_the_plate_and_the_r.gif" width="100%">
</td>
</tr>
<tr>
<td><center>Input: Open the middle drawer of the cabinet.</center></td>
<td><center>Input: Pick up the alphabet soup and place it in the basket.</center></td>
<td><center>Input: Pick up the black bowl between the plate and the ramekin and place it on the plate.</center></td>
</tr>
</table>
### World Model Results (Action + Image -> Image)
World Model generates the next frame given the current frame and action control.
<table>
<tr>
<td width="300" align="center">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_open_the_top_drawer_and_put_the_bowl_inside.gif" width="100%">
</td>
<td width="300" align="center">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_push_the_plate_to_the_front_of_the_stove.gif" width="100%">
</td>
<td width="300" align="center">
<img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_put_the_bowl_on_the_stove.gif" width="100%">
</td>
</tr>
<tr>
<td align="center">
Input: Action sequence of "Open the top drawer and put the bowl inside".
</td>
<td align="center">
Input: Action sequence of "Push the plate to the front of the stove".
</td>
<td align="center">
Input: Action sequence of "Put the bowl on the stove".
</td>
</tr>
</table>
## Model Zoo
| Model (256 * 256) | HF Link | Success Rate (%) |
| :--------------------: | :------------------------------------------------------------: | :--------------------: |
| LIBERO-Spatial | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_spatial](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_spatial) | 85.6 |
| LIBERO-Object | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_object) | 89.0 |
| LIBERO-Goal | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_goal) | 82.6 |
| LIBERO-Long | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_10) | 59.0 |
<br>
| Model (512 * 512) | HF Link | Success Rate (%) |
| :--------------------: | :------------------------------------------------------------: | :--------------------: |
| LIBERO-Spatial | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_spatial](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_spatial) | 87.6 |
| LIBERO-Object | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_object) | 96.2 |
| LIBERO-Goal | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_goal) | 83.4 |
| LIBERO-Long | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_10) | 60.0 |
## Citation <a name="citation"></a>
If you find the project helpful for your research, please consider citing our paper:
```bibtex
@article{cen2025worldvla,
title={WorldVLA: Towards Autoregressive Action World Model},
author={Cen, Jun and Yu, Chaohui and Yuan, Hangjie and Jiang, Yuming and Huang, Siteng and Guo, Jiayan and Li, Xin and Song, Yibing and Luo, Hao and Wang, Fan and others},
journal={arXiv preprint arXiv:2506.21539},
year={2025}
}
```
## Acknowledgment <a name="acknowledgment"></a>
This project builds upon [Lumina-mGPT](https://github.com/Alpha-VLLM/Lumina-mGPT), [Chemeleon](https://github.com/facebookresearch/chameleon), and [OpenVLA](http://github.com/openvla/openvla). We thank these teams for their open-source contributions.