CoWVLA / README.md

nielsr HF Staff

Improve model card and add robotics metadata

5554698 verified 4 days ago

2.09 kB

license: apache-2.0
pipeline_tag: robotics
library_name: transformers
tags:
  - vla
  - world-model
  - embodied-ai

Chain of World: World Model Thinking in Latent Motion

This repository contains the weights for CoWVLA (Chain-of-World VLA), a Vision-Language-Action framework that unifies world-model temporal reasoning with disentangled latent motion modeling.

🌐 Project Page | 📄 Paper | 💻 GitHub

Overview

CoWVLA introduces a "Chain of World" paradigm to address limitations in current VLA models. While world-model VLAs often waste capacity reconstructing redundant backgrounds and latent-action VLAs lack temporally continuous modeling, CoWVLA:

Uses a pretrained video VAE (VidTwin) to disentangle structure and motion latents.
Pre-trains a VLA decoder to infer a continuous latent motion chain from an instruction and initial frame.
Co-fine-tunes the model to align latent dynamics with discrete action prediction in a single autoregressive decoder.

This design preserves the temporal reasoning benefits of world models while maintaining the compactness and interpretability of latent actions.

Evaluation Results

CoWVLA demonstrates strong performance across major robotic simulation benchmarks:

Benchmark	Metric	CoWVLA
LIBERO	Spatial / Object / Goal / Long / Avg.	97.2 / 97.8 / 94.6 / 92.8 / 95.6
SimplerEnv-WidowX	Stack / Carrot / Spoon / Eggplant / Avg.	62.5 / 66.7 / 79.2 / 95.8 / 76.0

Citation

If you find this work useful for your research, please cite:

@inproceedings{yang2026cowvla,
  title     = {Chain of World: World Model Thinking in Latent Motion},
  author    = {Yang, Fuxiang and Di, Donglin and Tang, Lulu and Zhang, Xuancheng and Fan, Lei and Li, Hao and Chen, Wei and Su, Tonghua and Ma, Baorui},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}