nielsr HF Staff commited on
Commit
5554698
·
verified ·
1 Parent(s): 4c3b9db

Improve model card and add robotics metadata

Browse files

Hi, I'm Niels from the community science team at Hugging Face. I've opened this PR to improve the model card for CoWVLA.

The changes include:
- Adding the `robotics` pipeline tag to the metadata.
- Adding `library_name: transformers` as the configuration files indicate compatibility.
- Including links to the paper, project page, and the official GitHub repository.
- Providing a brief overview of the CoWVLA framework and its performance on benchmarks.
- Adding the BibTeX citation for researchers.

These additions help improve the discoverability and usability of the model on the Hub.

Files changed (1) hide show
  1. README.md +41 -2
README.md CHANGED
@@ -1,7 +1,46 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
3
  ---
4
 
5
- Project page: https://fx-hit.github.io/cowvla-io/
6
 
7
- Paper: https://huggingface.co/papers/2603.03195
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: robotics
4
+ library_name: transformers
5
+ tags:
6
+ - vla
7
+ - world-model
8
+ - embodied-ai
9
  ---
10
 
11
+ # Chain of World: World Model Thinking in Latent Motion
12
 
13
+ This repository contains the weights for **CoWVLA** (Chain-of-World VLA), a Vision-Language-Action framework that unifies world-model temporal reasoning with disentangled latent motion modeling.
14
+
15
+ [**🌐 Project Page**](https://fx-hit.github.io/cowvla-io/) | [**📄 Paper**](https://huggingface.co/papers/2603.03195) | [**💻 GitHub**](https://github.com/fx-hit/CoWVLA)
16
+
17
+ ## Overview
18
+
19
+ CoWVLA introduces a "Chain of World" paradigm to address limitations in current VLA models. While world-model VLAs often waste capacity reconstructing redundant backgrounds and latent-action VLAs lack temporally continuous modeling, CoWVLA:
20
+ - Uses a pretrained video VAE (**VidTwin**) to disentangle structure and motion latents.
21
+ - Pre-trains a VLA decoder to infer a continuous latent motion chain from an instruction and initial frame.
22
+ - Co-fine-tunes the model to align latent dynamics with discrete action prediction in a single autoregressive decoder.
23
+
24
+ This design preserves the temporal reasoning benefits of world models while maintaining the compactness and interpretability of latent actions.
25
+
26
+ ## Evaluation Results
27
+
28
+ CoWVLA demonstrates strong performance across major robotic simulation benchmarks:
29
+
30
+ | Benchmark | Metric | CoWVLA |
31
+ | --- | --- | --- |
32
+ | **LIBERO** | Spatial / Object / Goal / Long / Avg. | 97.2 / 97.8 / 94.6 / 92.8 / 95.6 |
33
+ | **SimplerEnv-WidowX** | Stack / Carrot / Spoon / Eggplant / Avg. | 62.5 / 66.7 / 79.2 / 95.8 / 76.0 |
34
+
35
+ ## Citation
36
+
37
+ If you find this work useful for your research, please cite:
38
+
39
+ ```bibtex
40
+ @inproceedings{yang2026cowvla,
41
+ title = {Chain of World: World Model Thinking in Latent Motion},
42
+ author = {Yang, Fuxiang and Di, Donglin and Tang, Lulu and Zhang, Xuancheng and Fan, Lei and Li, Hao and Chen, Wei and Su, Tonghua and Ma, Baorui},
43
+ booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
44
+ year = {2026}
45
+ }
46
+ ```