Add model card and metadata

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +35 -3
README.md CHANGED
@@ -1,3 +1,35 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-to-video
4
+ ---
5
+
6
+ # VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
7
+
8
+ [Paper](https://huggingface.co/papers/2602.10102) | [Project Page](https://maverickren.github.io/VideoWorld2.github.io/) | [Code](https://github.com/ByteDance-Seed/VideoWorld/tree/main/VideoWorld2)
9
+
10
+ VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance. This framework enables learning transferable world knowledge directly from raw real-world videos, which can then be applied to support long-horizon reasoning and task execution in new environments.
11
+
12
+ ## Highlights
13
+ - **Decoupled Action Dynamics:** Decouples task-relevant dynamics from visual appearance, enabling a dLDM to focus on meaningful latent codes.
14
+ - **Coherent Long Horizon Reasoning:** Models latent codes autoregressively to learn task policies and produce coherent long-horizon execution videos.
15
+ - **State-of-the-Art Performance:** Achieves up to 70% improvement in task success rates on challenging real-world handcrafting tasks.
16
+ - **Robotics Knowledge Transfer:** Demonstrates effective knowledge acquisition from the Open-X dataset, improving performance on manipulation benchmarks like CALVIN.
17
+
18
+ ## Architecture
19
+ Overview of the VideoWorld 2 model architecture:
20
+ 1. **Compression:** A dLDM compresses future visual changes into compact, generalizable latent codes.
21
+ 2. **Modeling:** These codes are modeled by an autoregressive transformer.
22
+ 3. **Inference:** The transformer predicts latent codes for an unseen environment from an initial input image, which are then decoded into task execution videos.
23
+
24
+ ## Citation
25
+ ```bibtex
26
+ @misc{ren2026videoworld2,
27
+ title={VideoWorld 2: Learning Transferable Knowledge from Real-world Videos},
28
+ author={Zhongwei Ren and Yunchao Wei and Xiao Yu and Guixun Luo and Yao Zhao and Bingyi Kang and Jiashi Feng and Xiaojie Jin},
29
+ year={2026},
30
+ eprint={2602.10102},
31
+ archivePrefix={arXiv},
32
+ primaryClass={cs.CV},
33
+ url={https://arxiv.org/abs/2602.10102},
34
+ }
35
+ ```