VerseCrafter Logo

VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

arxiv  github  huggingface  GitHub 

Sixiao Zheng1,2    Minghao Yin3    Wenbo Hu4†    Xiaoyu Li4    Ying Shan4    Yanwei Fu1,2†

1Fudan University    2Shanghai Innovation Institute    3HKU    4ARC Lab, Tencent PCG

Corresponding authors

✨ A controllable video world model with explicit 4D geometric control over camera and multi-object motion.

TL;DR

  • Dynamic Realistic Video World Model: VerseCrafter learns a realistic and controllable video world prior from large-scale in-the-wild data, handling challenging dynamic scenes with strong spatial-temporal coherence.
  • 4D Geometric Control: A unified 4D control state provides direct, interpretable control over camera motion, multi-object motion, and their joint coordination, improving geometric faithfulness.
  • Frozen Video Prior + GeoAdapter: We attach a geometry-aware GeoAdapter to a frozen Wan2.1 backbone, injecting 4D controls into diffusion blocks for precise control without sacrificing video quality.
  • VerseControl4D Dataset: We introduce a large-scale real-world dataset with automatically rendered camera trajectories and multi-object 3D Gaussian trajectories to supervise 4D controllable generation.

Model Details

Property Value
Base Model Wan2.1-T2V-14B
Resolution 720 × 1280
Frames 81 frames
Control Signals Camera trajectory + 3D Gaussian object trajectories
Architecture Frozen DiT + Trainable GeoAdapter

Usage

For installation, inference, and the complete pipeline (depth estimation, segmentation, 3D Gaussian fitting, trajectory customization in Blender, and video generation), please refer to our GitHub repository.

Citation

If you find this work useful, please consider citing:

@article{zheng2026versecrafter,
  title={VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control},
  author={Zheng, Sixiao and Yin, Minghao and Hu, Wenbo and Li, Xiaoyu and Shan, Ying and Fu, Yanwei},
  journal={arXiv preprint arXiv:2601.05138},
  year={2026}
}

Acknowledgements

Our work is built upon MoGe, Grounded-SAM-2, VideoX-Fun, Wan2.1 and diffusers.

License

This project is released under the VerseCrafter License. It is intended for academic/research purposes only and commercial use is not permitted.

Downloads last month
57
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TencentARC/VerseCrafter

Finetuned
(40)
this model

Paper for TencentARC/VerseCrafter