| | --- |
| | license: apache-2.0 |
| | base_model: |
| | - facebook/chameleon-7b |
| | tags: |
| | - VLA |
| | - Robotics |
| | --- |
| | <p align="center"> |
| | <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/logo.png?raw=true" width="80" style="margin-bottom: 0.1;"/> |
| | <p> |
| | |
| | <h3 align="center"><a href="https://github.com/alibaba-damo-academy/WorldVLA/tree/main" style="color:#9C276A"> |
| | WorldVLA: Towards Autoregressive Action World Model</a></h3> |
| | <h5 align="center"> If our project helps you, please give us a star β on GitHub to support us. ππ </h2> |
| |
|
| |
|
| | <h5 align="center"> |
| |
|
| | [](https://arxiv.org/pdf/2506.21539) |
| | [](https://github.com/alibaba-damo-academy/WorldVLA) |
| | [](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA) |
| | [](https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/LICENSE) |
| | </h5> |
| |
|
| |
|
| | ## π Introduction |
| | WorldVLA is an autoregressive action world model that unifies action and image understanding and generation. WorldVLA intergrates Vision-Language-Action (VLA) model (action model) and world model in one single framework. |
| |
|
| | <div style="text-align: center;"> |
| | <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/overview.png" style="max-width: 100%; height: auto; display: block; margin: 0 auto;"> |
| | </div> |
| | <br> |
| |
|
| | ### Action Model Results (Text + Image -> Action) |
| | Action Model generates actions given the text instruction and image observations. |
| |
|
| | <table> |
| | <tr> |
| | <td width="300"> |
| | <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_open_the_middle_drawer_of_the_cabinet.gif" width="100%"> |
| | </td> |
| | <td width="300"> |
| | <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_pick_up_the_alphabet_soup_and_place_it_in_the_bask.gif" width="100%"> |
| | </td> |
| | <td width="300"> |
| | <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/action_model_pick_up_the_black_bowl_between_the_plate_and_the_r.gif" width="100%"> |
| | </td> |
| | </tr> |
| | <tr> |
| | <td><center>Input: Open the middle drawer of the cabinet.</center></td> |
| | <td><center>Input: Pick up the alphabet soup and place it in the basket.</center></td> |
| | <td><center>Input: Pick up the black bowl between the plate and the ramekin and place it on the plate.</center></td> |
| | </tr> |
| | </table> |
| | |
| | ### World Model Results (Action + Image -> Image) |
| | World Model generates the next frame given the current frame and action control. |
| |
|
| | <table> |
| | <tr> |
| | <td width="300" align="center"> |
| | <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_open_the_top_drawer_and_put_the_bowl_inside.gif" width="100%"> |
| | </td> |
| | <td width="300" align="center"> |
| | <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_push_the_plate_to_the_front_of_the_stove.gif" width="100%"> |
| | </td> |
| | <td width="300" align="center"> |
| | <img src="https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/worldvla/assets/world_model_put_the_bowl_on_the_stove.gif" width="100%"> |
| | </td> |
| | </tr> |
| | <tr> |
| | <td align="center"> |
| | Input: Action sequence of "Open the top drawer and put the bowl inside". |
| | </td> |
| | <td align="center"> |
| | Input: Action sequence of "Push the plate to the front of the stove". |
| | </td> |
| | <td align="center"> |
| | Input: Action sequence of "Put the bowl on the stove". |
| | </td> |
| | </tr> |
| | </table> |
| | |
| |
|
| | ## Model Zoo |
| |
|
| | | Model (256 * 256) | HF Link | Success Rate (%) | |
| | | :--------------------: | :------------------------------------------------------------: | :--------------------: | |
| | | LIBERO-Spatial | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_spatial](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_spatial) | 85.6 | |
| | | LIBERO-Object | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_object) | 89.0 | |
| | | LIBERO-Goal | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_goal) | 82.6 | |
| | | LIBERO-Long | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_10) | 59.0 | |
| | <br> |
| |
|
| | | Model (512 * 512) | HF Link | Success Rate (%) | |
| | | :--------------------: | :------------------------------------------------------------: | :--------------------: | |
| | | LIBERO-Spatial | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_spatial](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_spatial) | 87.6 | |
| | | LIBERO-Object | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_object) | 96.2 | |
| | | LIBERO-Goal | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_goal) | 83.4 | |
| | | LIBERO-Long | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_10) | 60.0 | |
| |
|
| |
|
| | ## Citation <a name="citation"></a> |
| | If you find the project helpful for your research, please consider citing our paper: |
| | ```bibtex |
| | @article{cen2025worldvla, |
| | title={WorldVLA: Towards Autoregressive Action World Model}, |
| | author={Cen, Jun and Yu, Chaohui and Yuan, Hangjie and Jiang, Yuming and Huang, Siteng and Guo, Jiayan and Li, Xin and Song, Yibing and Luo, Hao and Wang, Fan and others}, |
| | journal={arXiv preprint arXiv:2506.21539}, |
| | year={2025} |
| | } |
| | ``` |
| |
|
| | ## Acknowledgment <a name="acknowledgment"></a> |
| | This project builds upon [Lumina-mGPT](https://github.com/Alpha-VLLM/Lumina-mGPT), [Chemeleon](https://github.com/facebookresearch/chameleon), and [OpenVLA](http://github.com/openvla/openvla). We thank these teams for their open-source contributions. |
| |
|