maverickrzw
/

VideoWorld2_dLDM_2B

Model card Files Files and versions

Add model card and metadata

#1

by nielsr HF Staff - opened Feb 12

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

Files changed (1) hide show

README.md +35 -3

README.md CHANGED Viewed

@@ -1,3 +1,35 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+pipeline_tag: image-to-video
+---
+# VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
+[Paper](https://huggingface.co/papers/2602.10102) | [Project Page](https://maverickren.github.io/VideoWorld2.github.io/) | [Code](https://github.com/ByteDance-Seed/VideoWorld/tree/main/VideoWorld2)
+VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance. This framework enables learning transferable world knowledge directly from raw real-world videos, which can then be applied to support long-horizon reasoning and task execution in new environments.
+## Highlights
+- **Decoupled Action Dynamics:** Decouples task-relevant dynamics from visual appearance, enabling a dLDM to focus on meaningful latent codes.
+- **Coherent Long Horizon Reasoning:** Models latent codes autoregressively to learn task policies and produce coherent long-horizon execution videos.
+- **State-of-the-Art Performance:** Achieves up to 70% improvement in task success rates on challenging real-world handcrafting tasks.
+- **Robotics Knowledge Transfer:** Demonstrates effective knowledge acquisition from the Open-X dataset, improving performance on manipulation benchmarks like CALVIN.
+## Architecture
+Overview of the VideoWorld 2 model architecture:
+1. **Compression:** A dLDM compresses future visual changes into compact, generalizable latent codes.
+2. **Modeling:** These codes are modeled by an autoregressive transformer.
+3. **Inference:** The transformer predicts latent codes for an unseen environment from an initial input image, which are then decoded into task execution videos.
+## Citation
+```bibtex
+@misc{ren2026videoworld2,
+  title={VideoWorld 2: Learning Transferable Knowledge from Real-world Videos},
+  author={Zhongwei Ren and Yunchao Wei and Xiao Yu and Guixun Luo and Yao Zhao and Bingyi Kang and Jiashi Feng and Xiaojie Jin},
+  year={2026},
+  eprint={2602.10102},
+  archivePrefix={arXiv},
+  primaryClass={cs.CV},
+  url={https://arxiv.org/abs/2602.10102},
+}
+```