license: apache-2.0
pipeline_tag: robotics
library_name: transformers
tags:
- vla
- world-model
- embodied-ai
Chain of World: World Model Thinking in Latent Motion
This repository contains the weights for CoWVLA (Chain-of-World VLA), a Vision-Language-Action framework that unifies world-model temporal reasoning with disentangled latent motion modeling.
🌐 Project Page | 📄 Paper | 💻 GitHub
Overview
CoWVLA introduces a "Chain of World" paradigm to address limitations in current VLA models. While world-model VLAs often waste capacity reconstructing redundant backgrounds and latent-action VLAs lack temporally continuous modeling, CoWVLA:
- Uses a pretrained video VAE (VidTwin) to disentangle structure and motion latents.
- Pre-trains a VLA decoder to infer a continuous latent motion chain from an instruction and initial frame.
- Co-fine-tunes the model to align latent dynamics with discrete action prediction in a single autoregressive decoder.
This design preserves the temporal reasoning benefits of world models while maintaining the compactness and interpretability of latent actions.
Evaluation Results
CoWVLA demonstrates strong performance across major robotic simulation benchmarks:
| Benchmark | Metric | CoWVLA |
|---|---|---|
| LIBERO | Spatial / Object / Goal / Long / Avg. | 97.2 / 97.8 / 94.6 / 92.8 / 95.6 |
| SimplerEnv-WidowX | Stack / Carrot / Spoon / Eggplant / Avg. | 62.5 / 66.7 / 79.2 / 95.8 / 76.0 |
Citation
If you find this work useful for your research, please cite:
@inproceedings{yang2026cowvla,
title = {Chain of World: World Model Thinking in Latent Motion},
author = {Yang, Fuxiang and Di, Donglin and Tang, Lulu and Zhang, Xuancheng and Fan, Lei and Li, Hao and Chen, Wei and Su, Tonghua and Ma, Baorui},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}