metadata
license: apache-2.0
pipeline_tag: image-to-video
VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
Paper | Project Page | Code
VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance. This framework enables learning transferable world knowledge directly from raw real-world videos, which can then be applied to support long-horizon reasoning and task execution in new environments.
Highlights
- Decoupled Action Dynamics: Decouples task-relevant dynamics from visual appearance, enabling a dLDM to focus on meaningful latent codes.
- Coherent Long Horizon Reasoning: Models latent codes autoregressively to learn task policies and produce coherent long-horizon execution videos.
- State-of-the-Art Performance: Achieves up to 70% improvement in task success rates on challenging real-world handcrafting tasks.
- Robotics Knowledge Transfer: Demonstrates effective knowledge acquisition from the Open-X dataset, improving performance on manipulation benchmarks like CALVIN.
Architecture
Overview of the VideoWorld 2 model architecture:
- Compression: A dLDM compresses future visual changes into compact, generalizable latent codes.
- Modeling: These codes are modeled by an autoregressive transformer.
- Inference: The transformer predicts latent codes for an unseen environment from an initial input image, which are then decoded into task execution videos.
Citation
@misc{ren2026videoworld2,
title={VideoWorld 2: Learning Transferable Knowledge from Real-world Videos},
author={Zhongwei Ren and Yunchao Wei and Xiao Yu and Guixun Luo and Yao Zhao and Bingyi Kang and Jiashi Feng and Xiaojie Jin},
year={2026},
eprint={2602.10102},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.10102},
}