File size: 4,337 Bytes
457bc79 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
---
license: other
license_name: versecrafter-license
license_link: LICENSE
tags:
- video-generation
- image-to-video
- diffusion
- 4d-control
- camera-control
- object-motion
- world-model
language:
- en
base_model:
- Wan-AI/Wan2.1-T2V-14B
pipeline_tag: image-to-video
extra_gated_eu_disallowed: true
---
<p align="center">
<img src="assets/versecrafter.png" alt="VerseCrafter Logo" width="300">
</p>
<h2 align="center">
VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
</h2>
<a href="https://arxiv.org/pdf/2601.05138"><img src='https://img.shields.io/badge/arXiv-Paper-red?style=flat&logo=arXiv&logoColor=red' alt='arxiv'></a>
<a href="https://github.com/TencentARC/VerseCrafter"><img src='https://img.shields.io/badge/GitHub-Code-blue?style=flat&logo=GitHub' alt='github'></a>
<a href="https://huggingface.co/TencentARC/VerseCrafter"><img src='https://img.shields.io/badge/Hugging Face-ckpts-orange?style=flat&logo=HuggingFace&logoColor=orange' alt='huggingface'></a>
<a href="https://sixiaozheng.github.io/VerseCrafter_page/"><img src='https://img.shields.io/badge/Project-Page-Green' alt='GitHub'></a>
<p align="center">
<a href="https://sixiaozheng.github.io/">Sixiao Zheng</a><sup>1,2</sup>
<a href="#">Minghao Yin</a><sup>3</sup>
<a href="https://wbhu.github.io/">Wenbo Hu</a><sup>4†</sup>
<a href="https://xiaoyu258.github.io/">Xiaoyu Li</a><sup>4</sup>
<a href="https://www.linkedin.com/in/YingShanProfile">Ying Shan</a><sup>4</sup>
<a href="https://yanweifu.github.io/">Yanwei Fu</a><sup>1,2†</sup>
</p>
<p align="center">
<sup>1</sup>Fudan University <sup>2</sup>Shanghai Innovation Institute <sup>3</sup>HKU <sup>4</sup>ARC Lab, Tencent PCG
</p>
<p align="center">
<sup>†</sup>Corresponding authors
</p>
✨ A controllable video world model with explicit 4D geometric control over camera and multi-object motion.
## TL;DR
- **Dynamic Realistic Video World Model**: VerseCrafter learns a realistic and controllable video world prior from large-scale in-the-wild data, handling challenging dynamic scenes with strong spatial-temporal coherence.
- **4D Geometric Control**: A unified 4D control state provides direct, interpretable control over camera motion, multi-object motion, and their joint coordination, improving geometric faithfulness.
- **Frozen Video Prior + GeoAdapter**: We attach a geometry-aware GeoAdapter to a frozen Wan2.1 backbone, injecting 4D controls into diffusion blocks for precise control without sacrificing video quality.
- **VerseControl4D Dataset**: We introduce a large-scale real-world dataset with automatically rendered camera trajectories and multi-object 3D Gaussian trajectories to supervise 4D controllable generation.
## Model Details
| Property | Value |
|----------|-------|
| **Base Model** | [Wan2.1-T2V-14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) |
| **Resolution** | 720 × 1280 |
| **Frames** | 81 frames |
| **Control Signals** | Camera trajectory + 3D Gaussian object trajectories |
| **Architecture** | Frozen DiT + Trainable GeoAdapter |
## Usage
For installation, inference, and the complete pipeline (depth estimation, segmentation, 3D Gaussian fitting, trajectory customization in Blender, and video generation), please refer to our [GitHub repository](https://github.com/TencentARC/VerseCrafter).
## Citation
If you find this work useful, please consider citing:
```bibtex
@article{zheng2026versecrafter,
title={VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control},
author={Zheng, Sixiao and Yin, Minghao and Hu, Wenbo and Li, Xiaoyu and Shan, Ying and Fu, Yanwei},
journal={arXiv preprint arXiv:2601.05138},
year={2026}
}
```
## Acknowledgements
Our work is built upon [MoGe](https://github.com/microsoft/MoGe), [Grounded-SAM-2](https://github.com/IDEA-Research/Grounded-SAM-2), [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun), [Wan2.1](https://github.com/Wan-Video/Wan2.1) and [diffusers](https://github.com/huggingface/diffusers).
## License
This project is released under the [VerseCrafter License](LICENSE). It is intended for **academic/research purposes only** and commercial use is not permitted.
|