File size: 4,616 Bytes

6eff5c3
 
 
 
 
 
 
 
 
 
 
 
 
b97dcb7
a971481
b97dcb7
 
 
 
 
 
c882930
b97dcb7
 
 
 
 
 
817a1e6
b97dcb7
 
 
6eff5c3
b97dcb7
6eff5c3
b97dcb7
6eff5c3
b97dcb7
6eff5c3
 
b97dcb7
6eff5c3
b97dcb7
6eff5c3
b97dcb7
6eff5c3
 
 
b97dcb7
6eff5c3
b97dcb7
6eff5c3
b97dcb7
6eff5c3
b97dcb7
6eff5c3
b97dcb7
6eff5c3
b97dcb7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6eff5c3
b97dcb7
6eff5c3
 
b97dcb7
6eff5c3
b97dcb7
6eff5c3
 
b97dcb7
 
 
 
817a1e6
 
 
 
 
 
b97dcb7
 
 
 
6eff5c3
b97dcb7

---
language: [en]
license: other
library_name: motioncrafter
tags:
- motion
- video
- 4d
- diffusion
- scene-flow
pipeline_tag: image-to-3d
base_model: stabilityai/stable-video-diffusion-img2vid-xt
---

<h1 align="center" style="font-size: 1.6em;">MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE</h1>

<div align="center">

[Ruijie Zhu](https://ruijiezhu94.github.io/ruijiezhu/)<sup>1,2</sup>,
[Jiahao Lu](https://scholar.google.com/citations?user=cRpteW4AAAAJ&hl=en)<sup>3</sup>,
[Wenbo Hu](https://wbhu.github.io/)<sup>2</sup>,
[Xiaoguang Han](https://scholar.google.com/citations?user=z-rqsR4AAAAJ&hl=en)<sup>4</sup><br>
[Jianfei Cai](https://jianfei-cai.github.io/)<sup>5</sup>,
[Ying Shan](https://scholar.google.com/citations?user=4oXBp9UAAAAJ&hl=en)<sup>2</sup>,
[Chuanxia Zheng](https://physicalvision.github.io/people/~chuanxia)<sup>1</sup>

<sup>1</sup> NTU &nbsp; <sup>2</sup> ARC Lab, Tencent PCG &nbsp; <sup>3</sup> HKUST &nbsp; <sup>4</sup> CUHK(SZ) &nbsp; <sup>5</sup> Monash University

[📄 Paper](https://arxiv.org/abs/2602.08961) | [🌐 Project Page](https://ruijiezhu94.github.io/MotionCrafter_Page/) | [💻 Code](https://github.com/TencentARC/MotionCrafter) | [📜 License](LICENSE.txt)

</div>

## Model Description

MotionCrafter is a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense object motion from monocular videos. It predicts dense point maps and scene flow for each frame within a shared world coordinate system, without requiring post-optimization.

## Intended Use

- Research on 4D reconstruction and motion estimation from monocular videos
- Academic evaluation and benchmarking of dense point map and scene flow prediction

Not intended for safety-critical or real-time production use.

## Limitations

- Performance can degrade with extreme motion blur or severe occlusion.
- Output quality is sensitive to input resolution and video quality.
- Generalization may be limited for out-of-domain scenes.

## Training Data

Training data details and preprocessing are described in the paper and main repository. If you need dataset specifics, please refer to the project page and the paper.

## Evaluation

Please refer to the paper for evaluation datasets, metrics, and results.

## How to Use

```python
import torch
from motioncrafter import (
    MotionCrafterDiffPipeline,
    MotionCrafterDetermPipeline,
    UnifyAutoencoderKL,
    UNetSpatioTemporalConditionModelVid2vid
)

unet_path = "TencentARC/MotionCrafter"
vae_path = "TencentARC/MotionCrafter"
model_type = "determ"  # or "diff" for diffusion version
cache_dir = "./pretrained_models"

unet = UNetSpatioTemporalConditionModelVid2vid.from_pretrained(
    unet_path,
    subfolder='unet_diff' if model_type == 'diff' else 'unet_determ',
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    cache_dir=cache_dir
).requires_grad_(False).to("cuda", dtype=torch.float16)

geometry_motion_vae = UnifyAutoencoderKL.from_pretrained(
    vae_path,
    subfolder='geometry_motion_vae',
    low_cpu_mem_usage=True,
    torch_dtype=torch.float32,
    cache_dir=cache_dir
).requires_grad_(False).to("cuda", dtype=torch.float32)

if model_type == 'diff':
    pipe = MotionCrafterDiffPipeline.from_pretrained(
        "stabilityai/stable-video-diffusion-img2vid-xt",
        unet=unet,
        torch_dtype=torch.float16,
        variant="fp16",
        cache_dir=cache_dir
    ).to("cuda")
else:
    pipe = MotionCrafterDetermPipeline.from_pretrained(
        "stabilityai/stable-video-diffusion-img2vid-xt",
        unet=unet,
        torch_dtype=torch.float16,
        variant="fp16",
        cache_dir=cache_dir
    ).to("cuda")
```

## Model Weights

- geometry_motion_vae/: 4D VAE for joint geometry and motion representation
- unet_determ/: deterministic UNet for motion prediction

## Model Variants

- Deterministic (unet_determ): fast inference with fixed predictions per input
- Diffusion (unet_diff): probabilistic predictions with diverse outputs

## Citation

```bibtex
@article{zhu2025motioncrafter,
  title={MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE},
  author={Zhu, Ruijie and Lu, Jiahao and Hu, Wenbo and Han, Xiaoguang and Cai, Jianfei and Shan, Ying and Zheng, Chuanxia},
  journal={arXiv preprint arXiv:2602.08961},
  year={2026}
}
```

## License

This model is provided under the Tencent License. See [LICENSE.txt](LICENSE.txt) for details.

## Acknowledgments

This work builds upon [GeometryCrafter](https://github.com/TencentARC/GeometryCrafter). We thank the authors for their excellent contributions.