nielsr HF Staff

Add model card and metadata

464e536 verified 2 months ago

2.06 kB

license: apache-2.0
pipeline_tag: image-to-video

VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

Paper | Project Page | Code

VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance. This framework enables learning transferable world knowledge directly from raw real-world videos, which can then be applied to support long-horizon reasoning and task execution in new environments.

Highlights

Decoupled Action Dynamics: Decouples task-relevant dynamics from visual appearance, enabling a dLDM to focus on meaningful latent codes.
Coherent Long Horizon Reasoning: Models latent codes autoregressively to learn task policies and produce coherent long-horizon execution videos.
State-of-the-Art Performance: Achieves up to 70% improvement in task success rates on challenging real-world handcrafting tasks.
Robotics Knowledge Transfer: Demonstrates effective knowledge acquisition from the Open-X dataset, improving performance on manipulation benchmarks like CALVIN.

Architecture

Overview of the VideoWorld 2 model architecture:

Compression: A dLDM compresses future visual changes into compact, generalizable latent codes.
Modeling: These codes are modeled by an autoregressive transformer.
Inference: The transformer predicts latent codes for an unseen environment from an initial input image, which are then decoded into task execution videos.

Citation

@misc{ren2026videoworld2,
  title={VideoWorld 2: Learning Transferable Knowledge from Real-world Videos}, 
  author={Zhongwei Ren and Yunchao Wei and Xiao Yu and Guixun Luo and Yao Zhao and Bingyi Kang and Jiashi Feng and Xiaojie Jin},
  year={2026},
  eprint={2602.10102},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.10102}, 
}