Add model card and metadata
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,3 +1,35 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
--
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-to-video
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
|
| 7 |
+
|
| 8 |
+
[Paper](https://huggingface.co/papers/2602.10102) | [Project Page](https://maverickren.github.io/VideoWorld2.github.io/) | [Code](https://github.com/ByteDance-Seed/VideoWorld/tree/main/VideoWorld2)
|
| 9 |
+
|
| 10 |
+
VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance. This framework enables learning transferable world knowledge directly from raw real-world videos, which can then be applied to support long-horizon reasoning and task execution in new environments.
|
| 11 |
+
|
| 12 |
+
## Highlights
|
| 13 |
+
- **Decoupled Action Dynamics:** Decouples task-relevant dynamics from visual appearance, enabling a dLDM to focus on meaningful latent codes.
|
| 14 |
+
- **Coherent Long Horizon Reasoning:** Models latent codes autoregressively to learn task policies and produce coherent long-horizon execution videos.
|
| 15 |
+
- **State-of-the-Art Performance:** Achieves up to 70% improvement in task success rates on challenging real-world handcrafting tasks.
|
| 16 |
+
- **Robotics Knowledge Transfer:** Demonstrates effective knowledge acquisition from the Open-X dataset, improving performance on manipulation benchmarks like CALVIN.
|
| 17 |
+
|
| 18 |
+
## Architecture
|
| 19 |
+
Overview of the VideoWorld 2 model architecture:
|
| 20 |
+
1. **Compression:** A dLDM compresses future visual changes into compact, generalizable latent codes.
|
| 21 |
+
2. **Modeling:** These codes are modeled by an autoregressive transformer.
|
| 22 |
+
3. **Inference:** The transformer predicts latent codes for an unseen environment from an initial input image, which are then decoded into task execution videos.
|
| 23 |
+
|
| 24 |
+
## Citation
|
| 25 |
+
```bibtex
|
| 26 |
+
@misc{ren2026videoworld2,
|
| 27 |
+
title={VideoWorld 2: Learning Transferable Knowledge from Real-world Videos},
|
| 28 |
+
author={Zhongwei Ren and Yunchao Wei and Xiao Yu and Guixun Luo and Yao Zhao and Bingyi Kang and Jiashi Feng and Xiaojie Jin},
|
| 29 |
+
year={2026},
|
| 30 |
+
eprint={2602.10102},
|
| 31 |
+
archivePrefix={arXiv},
|
| 32 |
+
primaryClass={cs.CV},
|
| 33 |
+
url={https://arxiv.org/abs/2602.10102},
|
| 34 |
+
}
|
| 35 |
+
```
|