|
|
--- |
|
|
language: [en] |
|
|
license: other |
|
|
library_name: motioncrafter |
|
|
tags: |
|
|
- motion |
|
|
- video |
|
|
- 4d |
|
|
- diffusion |
|
|
- scene-flow |
|
|
pipeline_tag: image-to-3d |
|
|
base_model: stabilityai/stable-video-diffusion-img2vid-xt |
|
|
--- |
|
|
|
|
|
<h1 align="center" style="font-size: 1.6em;">MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE</h1> |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[Ruijie Zhu](https://ruijiezhu94.github.io/ruijiezhu/)<sup>1,2</sup>, |
|
|
[Jiahao Lu](https://scholar.google.com/citations?user=cRpteW4AAAAJ&hl=en)<sup>3</sup>, |
|
|
[Wenbo Hu](https://wbhu.github.io/)<sup>2</sup>, |
|
|
[Xiaoguang Han](https://scholar.google.com/citations?user=z-rqsR4AAAAJ&hl=en)<sup>4</sup><br> |
|
|
[Jianfei Cai](https://jianfei-cai.github.io/)<sup>5</sup>, |
|
|
[Ying Shan](https://scholar.google.com/citations?user=4oXBp9UAAAAJ&hl=en)<sup>2</sup>, |
|
|
[Chuanxia Zheng](https://physicalvision.github.io/people/~chuanxia)<sup>1</sup> |
|
|
|
|
|
<sup>1</sup> NTU <sup>2</sup> ARC Lab, Tencent PCG <sup>3</sup> HKUST <sup>4</sup> CUHK(SZ) <sup>5</sup> Monash University |
|
|
|
|
|
[๐ Paper](https://arxiv.org/abs/2602.08961) | [๐ Project Page](https://ruijiezhu94.github.io/MotionCrafter_Page/) | [๐ป Code](https://github.com/TencentARC/MotionCrafter) | [๐ License](LICENSE.txt) |
|
|
|
|
|
</div> |
|
|
|
|
|
## Model Description |
|
|
|
|
|
MotionCrafter is a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense object motion from monocular videos. It predicts dense point maps and scene flow for each frame within a shared world coordinate system, without requiring post-optimization. |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- Research on 4D reconstruction and motion estimation from monocular videos |
|
|
- Academic evaluation and benchmarking of dense point map and scene flow prediction |
|
|
|
|
|
Not intended for safety-critical or real-time production use. |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Performance can degrade with extreme motion blur or severe occlusion. |
|
|
- Output quality is sensitive to input resolution and video quality. |
|
|
- Generalization may be limited for out-of-domain scenes. |
|
|
|
|
|
## Training Data |
|
|
|
|
|
Training data details and preprocessing are described in the paper and main repository. If you need dataset specifics, please refer to the project page and the paper. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
Please refer to the paper for evaluation datasets, metrics, and results. |
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from motioncrafter import ( |
|
|
MotionCrafterDiffPipeline, |
|
|
MotionCrafterDetermPipeline, |
|
|
UnifyAutoencoderKL, |
|
|
UNetSpatioTemporalConditionModelVid2vid |
|
|
) |
|
|
|
|
|
unet_path = "TencentARC/MotionCrafter" |
|
|
vae_path = "TencentARC/MotionCrafter" |
|
|
model_type = "determ" # or "diff" for diffusion version |
|
|
cache_dir = "./pretrained_models" |
|
|
|
|
|
unet = UNetSpatioTemporalConditionModelVid2vid.from_pretrained( |
|
|
unet_path, |
|
|
subfolder='unet_diff' if model_type == 'diff' else 'unet_determ', |
|
|
low_cpu_mem_usage=True, |
|
|
torch_dtype=torch.float16, |
|
|
cache_dir=cache_dir |
|
|
).requires_grad_(False).to("cuda", dtype=torch.float16) |
|
|
|
|
|
geometry_motion_vae = UnifyAutoencoderKL.from_pretrained( |
|
|
vae_path, |
|
|
subfolder='geometry_motion_vae', |
|
|
low_cpu_mem_usage=True, |
|
|
torch_dtype=torch.float32, |
|
|
cache_dir=cache_dir |
|
|
).requires_grad_(False).to("cuda", dtype=torch.float32) |
|
|
|
|
|
if model_type == 'diff': |
|
|
pipe = MotionCrafterDiffPipeline.from_pretrained( |
|
|
"stabilityai/stable-video-diffusion-img2vid-xt", |
|
|
unet=unet, |
|
|
torch_dtype=torch.float16, |
|
|
variant="fp16", |
|
|
cache_dir=cache_dir |
|
|
).to("cuda") |
|
|
else: |
|
|
pipe = MotionCrafterDetermPipeline.from_pretrained( |
|
|
"stabilityai/stable-video-diffusion-img2vid-xt", |
|
|
unet=unet, |
|
|
torch_dtype=torch.float16, |
|
|
variant="fp16", |
|
|
cache_dir=cache_dir |
|
|
).to("cuda") |
|
|
``` |
|
|
|
|
|
## Model Weights |
|
|
|
|
|
- geometry_motion_vae/: 4D VAE for joint geometry and motion representation |
|
|
- unet_determ/: deterministic UNet for motion prediction |
|
|
|
|
|
## Model Variants |
|
|
|
|
|
- Deterministic (unet_determ): fast inference with fixed predictions per input |
|
|
- Diffusion (unet_diff): probabilistic predictions with diverse outputs |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{zhu2025motioncrafter, |
|
|
title={MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE}, |
|
|
author={Zhu, Ruijie and Lu, Jiahao and Hu, Wenbo and Han, Xiaoguang and Cai, Jianfei and Shan, Ying and Zheng, Chuanxia}, |
|
|
journal={arXiv preprint arXiv:2602.08961}, |
|
|
year={2026} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is provided under the Tencent License. See [LICENSE.txt](LICENSE.txt) for details. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
This work builds upon [GeometryCrafter](https://github.com/TencentARC/GeometryCrafter). We thank the authors for their excellent contributions. |
|
|
|
|
|
|