Improve model card and add robotics metadata
Browse filesHi, I'm Niels from the community science team at Hugging Face. I've opened this PR to improve the model card for CoWVLA.
The changes include:
- Adding the `robotics` pipeline tag to the metadata.
- Adding `library_name: transformers` as the configuration files indicate compatibility.
- Including links to the paper, project page, and the official GitHub repository.
- Providing a brief overview of the CoWVLA framework and its performance on benchmarks.
- Adding the BibTeX citation for researchers.
These additions help improve the discoverability and usability of the model on the Hub.
README.md
CHANGED
|
@@ -1,7 +1,46 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: robotics
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- vla
|
| 7 |
+
- world-model
|
| 8 |
+
- embodied-ai
|
| 9 |
---
|
| 10 |
|
| 11 |
+
# Chain of World: World Model Thinking in Latent Motion
|
| 12 |
|
| 13 |
+
This repository contains the weights for **CoWVLA** (Chain-of-World VLA), a Vision-Language-Action framework that unifies world-model temporal reasoning with disentangled latent motion modeling.
|
| 14 |
+
|
| 15 |
+
[**🌐 Project Page**](https://fx-hit.github.io/cowvla-io/) | [**📄 Paper**](https://huggingface.co/papers/2603.03195) | [**💻 GitHub**](https://github.com/fx-hit/CoWVLA)
|
| 16 |
+
|
| 17 |
+
## Overview
|
| 18 |
+
|
| 19 |
+
CoWVLA introduces a "Chain of World" paradigm to address limitations in current VLA models. While world-model VLAs often waste capacity reconstructing redundant backgrounds and latent-action VLAs lack temporally continuous modeling, CoWVLA:
|
| 20 |
+
- Uses a pretrained video VAE (**VidTwin**) to disentangle structure and motion latents.
|
| 21 |
+
- Pre-trains a VLA decoder to infer a continuous latent motion chain from an instruction and initial frame.
|
| 22 |
+
- Co-fine-tunes the model to align latent dynamics with discrete action prediction in a single autoregressive decoder.
|
| 23 |
+
|
| 24 |
+
This design preserves the temporal reasoning benefits of world models while maintaining the compactness and interpretability of latent actions.
|
| 25 |
+
|
| 26 |
+
## Evaluation Results
|
| 27 |
+
|
| 28 |
+
CoWVLA demonstrates strong performance across major robotic simulation benchmarks:
|
| 29 |
+
|
| 30 |
+
| Benchmark | Metric | CoWVLA |
|
| 31 |
+
| --- | --- | --- |
|
| 32 |
+
| **LIBERO** | Spatial / Object / Goal / Long / Avg. | 97.2 / 97.8 / 94.6 / 92.8 / 95.6 |
|
| 33 |
+
| **SimplerEnv-WidowX** | Stack / Carrot / Spoon / Eggplant / Avg. | 62.5 / 66.7 / 79.2 / 95.8 / 76.0 |
|
| 34 |
+
|
| 35 |
+
## Citation
|
| 36 |
+
|
| 37 |
+
If you find this work useful for your research, please cite:
|
| 38 |
+
|
| 39 |
+
```bibtex
|
| 40 |
+
@inproceedings{yang2026cowvla,
|
| 41 |
+
title = {Chain of World: World Model Thinking in Latent Motion},
|
| 42 |
+
author = {Yang, Fuxiang and Di, Donglin and Tang, Lulu and Zhang, Xuancheng and Fan, Lei and Li, Hao and Chen, Wei and Su, Tonghua and Ma, Baorui},
|
| 43 |
+
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
|
| 44 |
+
year = {2026}
|
| 45 |
+
}
|
| 46 |
+
```
|