--- license: other license_name: versecrafter-license license_link: LICENSE tags: - video-generation - image-to-video - diffusion - 4d-control - camera-control - object-motion - world-model language: - en base_model: - Wan-AI/Wan2.1-T2V-14B pipeline_tag: image-to-video ---
Sixiao Zheng1,2 Minghao Yin3 Wenbo Hu4† Xiaoyu Li4 Ying Shan4 Yanwei Fu1,2†
1Fudan University 2Shanghai Innovation Institute 3HKU 4ARC Lab, Tencent PCG
†Corresponding authors
✨ A controllable video world model with explicit 4D geometric control over camera and multi-object motion. ## TL;DR - **Dynamic Realistic Video World Model**: VerseCrafter learns a realistic and controllable video world prior from large-scale in-the-wild data, handling challenging dynamic scenes with strong spatial-temporal coherence. - **4D Geometric Control**: A unified 4D control state provides direct, interpretable control over camera motion, multi-object motion, and their joint coordination, improving geometric faithfulness. - **Frozen Video Prior + GeoAdapter**: We attach a geometry-aware GeoAdapter to a frozen Wan2.1 backbone, injecting 4D controls into diffusion blocks for precise control without sacrificing video quality. - **VerseControl4D Dataset**: We introduce a large-scale real-world dataset with automatically rendered camera trajectories and multi-object 3D Gaussian trajectories to supervise 4D controllable generation. ## Model Details | Property | Value | |----------|-------| | **Base Model** | [Wan2.1-T2V-14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) | | **Resolution** | 720 × 1280 | | **Frames** | 81 frames | | **Control Signals** | Camera trajectory + 3D Gaussian object trajectories | | **Architecture** | Frozen DiT + Trainable GeoAdapter | ## Usage For installation, inference, and the complete pipeline (depth estimation, segmentation, 3D Gaussian fitting, trajectory customization in Blender, and video generation), please refer to our [GitHub repository](https://github.com/TencentARC/VerseCrafter). ## Citation If you find this work useful, please consider citing: ```bibtex @article{zheng2026versecrafter, title={VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control}, author={Zheng, Sixiao and Yin, Minghao and Hu, Wenbo and Li, Xiaoyu and Shan, Ying and Fu, Yanwei}, journal={arXiv preprint arXiv:2601.05138}, year={2026} } ``` ## Acknowledgements Our work is built upon [MoGe](https://github.com/microsoft/MoGe), [Grounded-SAM-2](https://github.com/IDEA-Research/Grounded-SAM-2), [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun), [Wan2.1](https://github.com/Wan-Video/Wan2.1) and [diffusers](https://github.com/huggingface/diffusers). ## License This project is released under the [VerseCrafter License](LICENSE). It is intended for **academic/research purposes only** and commercial use is not permitted.