| | --- |
| | license: apache-2.0 |
| | pipeline_tag: robotics |
| | library_name: transformers |
| | tags: |
| | - vla |
| | - world-model |
| | - embodied-ai |
| | --- |
| | |
| | # Chain of World: World Model Thinking in Latent Motion |
| |
|
| | This repository contains the weights for **CoWVLA** (Chain-of-World VLA), a Vision-Language-Action framework that unifies world-model temporal reasoning with disentangled latent motion modeling. |
| |
|
| | [**π Project Page**](https://fx-hit.github.io/cowvla-io/) | [**π Paper**](https://huggingface.co/papers/2603.03195) | [**π» GitHub**](https://github.com/fx-hit/CoWVLA) |
| |
|
| | ## Overview |
| |
|
| | CoWVLA introduces a "Chain of World" paradigm to address limitations in current VLA models. While world-model VLAs often waste capacity reconstructing redundant backgrounds and latent-action VLAs lack temporally continuous modeling, CoWVLA: |
| | - Uses a pretrained video VAE (**VidTwin**) to disentangle structure and motion latents. |
| | - Pre-trains a VLA decoder to infer a continuous latent motion chain from an instruction and initial frame. |
| | - Co-fine-tunes the model to align latent dynamics with discrete action prediction in a single autoregressive decoder. |
| |
|
| | This design preserves the temporal reasoning benefits of world models while maintaining the compactness and interpretability of latent actions. |
| |
|
| | ## Evaluation Results |
| |
|
| | CoWVLA demonstrates strong performance across major robotic simulation benchmarks: |
| |
|
| | | Benchmark | Metric | CoWVLA | |
| | | --- | --- | --- | |
| | | **LIBERO** | Spatial / Object / Goal / Long / Avg. | 97.2 / 97.8 / 94.6 / 92.8 / 95.6 | |
| | | **SimplerEnv-WidowX** | Stack / Carrot / Spoon / Eggplant / Avg. | 62.5 / 66.7 / 79.2 / 95.8 / 76.0 | |
| |
|
| | ## Citation |
| |
|
| | If you find this work useful for your research, please cite: |
| |
|
| | ```bibtex |
| | @inproceedings{yang2026cowvla, |
| | title = {Chain of World: World Model Thinking in Latent Motion}, |
| | author = {Yang, Fuxiang and Di, Donglin and Tang, Lulu and Zhang, Xuancheng and Fan, Lei and Li, Hao and Chen, Wei and Su, Tonghua and Ma, Baorui}, |
| | booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, |
| | year = {2026} |
| | } |
| | ``` |