File size: 4,616 Bytes
6eff5c3 b97dcb7 a971481 b97dcb7 c882930 b97dcb7 817a1e6 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 6eff5c3 b97dcb7 817a1e6 b97dcb7 6eff5c3 b97dcb7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
language: [en]
license: other
library_name: motioncrafter
tags:
- motion
- video
- 4d
- diffusion
- scene-flow
pipeline_tag: image-to-3d
base_model: stabilityai/stable-video-diffusion-img2vid-xt
---
<h1 align="center" style="font-size: 1.6em;">MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE</h1>
<div align="center">
[Ruijie Zhu](https://ruijiezhu94.github.io/ruijiezhu/)<sup>1,2</sup>,
[Jiahao Lu](https://scholar.google.com/citations?user=cRpteW4AAAAJ&hl=en)<sup>3</sup>,
[Wenbo Hu](https://wbhu.github.io/)<sup>2</sup>,
[Xiaoguang Han](https://scholar.google.com/citations?user=z-rqsR4AAAAJ&hl=en)<sup>4</sup><br>
[Jianfei Cai](https://jianfei-cai.github.io/)<sup>5</sup>,
[Ying Shan](https://scholar.google.com/citations?user=4oXBp9UAAAAJ&hl=en)<sup>2</sup>,
[Chuanxia Zheng](https://physicalvision.github.io/people/~chuanxia)<sup>1</sup>
<sup>1</sup> NTU <sup>2</sup> ARC Lab, Tencent PCG <sup>3</sup> HKUST <sup>4</sup> CUHK(SZ) <sup>5</sup> Monash University
[๐ Paper](https://arxiv.org/abs/2602.08961) | [๐ Project Page](https://ruijiezhu94.github.io/MotionCrafter_Page/) | [๐ป Code](https://github.com/TencentARC/MotionCrafter) | [๐ License](LICENSE.txt)
</div>
## Model Description
MotionCrafter is a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense object motion from monocular videos. It predicts dense point maps and scene flow for each frame within a shared world coordinate system, without requiring post-optimization.
## Intended Use
- Research on 4D reconstruction and motion estimation from monocular videos
- Academic evaluation and benchmarking of dense point map and scene flow prediction
Not intended for safety-critical or real-time production use.
## Limitations
- Performance can degrade with extreme motion blur or severe occlusion.
- Output quality is sensitive to input resolution and video quality.
- Generalization may be limited for out-of-domain scenes.
## Training Data
Training data details and preprocessing are described in the paper and main repository. If you need dataset specifics, please refer to the project page and the paper.
## Evaluation
Please refer to the paper for evaluation datasets, metrics, and results.
## How to Use
```python
import torch
from motioncrafter import (
MotionCrafterDiffPipeline,
MotionCrafterDetermPipeline,
UnifyAutoencoderKL,
UNetSpatioTemporalConditionModelVid2vid
)
unet_path = "TencentARC/MotionCrafter"
vae_path = "TencentARC/MotionCrafter"
model_type = "determ" # or "diff" for diffusion version
cache_dir = "./pretrained_models"
unet = UNetSpatioTemporalConditionModelVid2vid.from_pretrained(
unet_path,
subfolder='unet_diff' if model_type == 'diff' else 'unet_determ',
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
cache_dir=cache_dir
).requires_grad_(False).to("cuda", dtype=torch.float16)
geometry_motion_vae = UnifyAutoencoderKL.from_pretrained(
vae_path,
subfolder='geometry_motion_vae',
low_cpu_mem_usage=True,
torch_dtype=torch.float32,
cache_dir=cache_dir
).requires_grad_(False).to("cuda", dtype=torch.float32)
if model_type == 'diff':
pipe = MotionCrafterDiffPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt",
unet=unet,
torch_dtype=torch.float16,
variant="fp16",
cache_dir=cache_dir
).to("cuda")
else:
pipe = MotionCrafterDetermPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt",
unet=unet,
torch_dtype=torch.float16,
variant="fp16",
cache_dir=cache_dir
).to("cuda")
```
## Model Weights
- geometry_motion_vae/: 4D VAE for joint geometry and motion representation
- unet_determ/: deterministic UNet for motion prediction
## Model Variants
- Deterministic (unet_determ): fast inference with fixed predictions per input
- Diffusion (unet_diff): probabilistic predictions with diverse outputs
## Citation
```bibtex
@article{zhu2025motioncrafter,
title={MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE},
author={Zhu, Ruijie and Lu, Jiahao and Hu, Wenbo and Han, Xiaoguang and Cai, Jianfei and Shan, Ying and Zheng, Chuanxia},
journal={arXiv preprint arXiv:2602.08961},
year={2026}
}
```
## License
This model is provided under the Tencent License. See [LICENSE.txt](LICENSE.txt) for details.
## Acknowledgments
This work builds upon [GeometryCrafter](https://github.com/TencentARC/GeometryCrafter). We thank the authors for their excellent contributions.
|