--- license: other license_name: versecrafter-license license_link: LICENSE tags: - video-generation - image-to-video - diffusion - 4d-control - camera-control - object-motion - world-model language: - en base_model: - Wan-AI/Wan2.1-T2V-14B pipeline_tag: image-to-video ---

VerseCrafter Logo

VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

arxiv  github  huggingface  GitHub 

Sixiao Zheng1,2    Minghao Yin3    Wenbo Hu4†    Xiaoyu Li4    Ying Shan4    Yanwei Fu1,2†

1Fudan University    2Shanghai Innovation Institute    3HKU    4ARC Lab, Tencent PCG

Corresponding authors

✨ A controllable video world model with explicit 4D geometric control over camera and multi-object motion. ## TL;DR - **Dynamic Realistic Video World Model**: VerseCrafter learns a realistic and controllable video world prior from large-scale in-the-wild data, handling challenging dynamic scenes with strong spatial-temporal coherence. - **4D Geometric Control**: A unified 4D control state provides direct, interpretable control over camera motion, multi-object motion, and their joint coordination, improving geometric faithfulness. - **Frozen Video Prior + GeoAdapter**: We attach a geometry-aware GeoAdapter to a frozen Wan2.1 backbone, injecting 4D controls into diffusion blocks for precise control without sacrificing video quality. - **VerseControl4D Dataset**: We introduce a large-scale real-world dataset with automatically rendered camera trajectories and multi-object 3D Gaussian trajectories to supervise 4D controllable generation. ## Model Details | Property | Value | |----------|-------| | **Base Model** | [Wan2.1-T2V-14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) | | **Resolution** | 720 × 1280 | | **Frames** | 81 frames | | **Control Signals** | Camera trajectory + 3D Gaussian object trajectories | | **Architecture** | Frozen DiT + Trainable GeoAdapter | ## Usage For installation, inference, and the complete pipeline (depth estimation, segmentation, 3D Gaussian fitting, trajectory customization in Blender, and video generation), please refer to our [GitHub repository](https://github.com/TencentARC/VerseCrafter). ## Citation If you find this work useful, please consider citing: ```bibtex @article{zheng2026versecrafter, title={VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control}, author={Zheng, Sixiao and Yin, Minghao and Hu, Wenbo and Li, Xiaoyu and Shan, Ying and Fu, Yanwei}, journal={arXiv preprint arXiv:2601.05138}, year={2026} } ``` ## Acknowledgements Our work is built upon [MoGe](https://github.com/microsoft/MoGe), [Grounded-SAM-2](https://github.com/IDEA-Research/Grounded-SAM-2), [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun), [Wan2.1](https://github.com/Wan-Video/Wan2.1) and [diffusers](https://github.com/huggingface/diffusers). ## License This project is released under the [VerseCrafter License](LICENSE). It is intended for **academic/research purposes only** and commercial use is not permitted.