File size: 4,337 Bytes
457bc79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
license: other
license_name: versecrafter-license
license_link: LICENSE
tags:
  - video-generation
  - image-to-video
  - diffusion
  - 4d-control
  - camera-control
  - object-motion
  - world-model
language:
  - en
base_model:
  - Wan-AI/Wan2.1-T2V-14B
pipeline_tag: image-to-video
extra_gated_eu_disallowed: true
---
<p align="center">
  <img src="assets/versecrafter.png" alt="VerseCrafter Logo" width="300">
</p>

<h2 align="center"> 
    VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control
</h2>

<a href="https://arxiv.org/pdf/2601.05138"><img src='https://img.shields.io/badge/arXiv-Paper-red?style=flat&logo=arXiv&logoColor=red' alt='arxiv'></a>&nbsp;
<a href="https://github.com/TencentARC/VerseCrafter"><img src='https://img.shields.io/badge/GitHub-Code-blue?style=flat&logo=GitHub' alt='github'></a>&nbsp;
<a href="https://huggingface.co/TencentARC/VerseCrafter"><img src='https://img.shields.io/badge/Hugging Face-ckpts-orange?style=flat&logo=HuggingFace&logoColor=orange' alt='huggingface'></a>&nbsp;
<a href="https://sixiaozheng.github.io/VerseCrafter_page/"><img src='https://img.shields.io/badge/Project-Page-Green' alt='GitHub'></a>&nbsp;

<p align="center">
  <a href="https://sixiaozheng.github.io/">Sixiao Zheng</a><sup>1,2</sup> &nbsp;&nbsp;
  <a href="#">Minghao Yin</a><sup>3</sup> &nbsp;&nbsp;
  <a href="https://wbhu.github.io/">Wenbo Hu</a><sup>4†</sup> &nbsp;&nbsp;
  <a href="https://xiaoyu258.github.io/">Xiaoyu Li</a><sup>4</sup> &nbsp;&nbsp;
  <a href="https://www.linkedin.com/in/YingShanProfile">Ying Shan</a><sup>4</sup> &nbsp;&nbsp;
  <a href="https://yanweifu.github.io/">Yanwei Fu</a><sup>1,2†</sup>
</p>

<p align="center">
  <sup>1</sup>Fudan University &nbsp;&nbsp; <sup>2</sup>Shanghai Innovation Institute &nbsp;&nbsp; <sup>3</sup>HKU &nbsp;&nbsp; <sup>4</sup>ARC Lab, Tencent PCG
</p>

<p align="center">
  <sup></sup>Corresponding authors
</p>

✨ A controllable video world model with explicit 4D geometric control over camera and multi-object motion.

## TL;DR

- **Dynamic Realistic Video World Model**: VerseCrafter learns a realistic and controllable video world prior from large-scale in-the-wild data, handling challenging dynamic scenes with strong spatial-temporal coherence.
- **4D Geometric Control**: A unified 4D control state provides direct, interpretable control over camera motion, multi-object motion, and their joint coordination, improving geometric faithfulness.
- **Frozen Video Prior + GeoAdapter**: We attach a geometry-aware GeoAdapter to a frozen Wan2.1 backbone, injecting 4D controls into diffusion blocks for precise control without sacrificing video quality.
- **VerseControl4D Dataset**: We introduce a large-scale real-world dataset with automatically rendered camera trajectories and multi-object 3D Gaussian trajectories to supervise 4D controllable generation.


## Model Details

| Property | Value |
|----------|-------|
| **Base Model** | [Wan2.1-T2V-14B](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) |
| **Resolution** | 720 × 1280 |
| **Frames** | 81 frames |
| **Control Signals** | Camera trajectory + 3D Gaussian object trajectories |
| **Architecture** | Frozen DiT + Trainable GeoAdapter |

## Usage

For installation, inference, and the complete pipeline (depth estimation, segmentation, 3D Gaussian fitting, trajectory customization in Blender, and video generation), please refer to our [GitHub repository](https://github.com/TencentARC/VerseCrafter).

## Citation

If you find this work useful, please consider citing:

```bibtex
@article{zheng2026versecrafter,
  title={VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control},
  author={Zheng, Sixiao and Yin, Minghao and Hu, Wenbo and Li, Xiaoyu and Shan, Ying and Fu, Yanwei},
  journal={arXiv preprint arXiv:2601.05138},
  year={2026}
}
```

## Acknowledgements

Our work is built upon [MoGe](https://github.com/microsoft/MoGe), [Grounded-SAM-2](https://github.com/IDEA-Research/Grounded-SAM-2), [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun), [Wan2.1](https://github.com/Wan-Video/Wan2.1) and [diffusers](https://github.com/huggingface/diffusers).

## License

This project is released under the [VerseCrafter License](LICENSE). It is intended for **academic/research purposes only** and commercial use is not permitted.