MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
Ruijie Zhu1,2, Jiahao Lu3, Wenbo Hu2, Xiaoguang Han4, Jianfei Cai5, Ying Shan2, Chuanxia Zheng1
1 NTU 2 ARC Lab, Tencent PCG 3 HKUST 4 CUHK(SZ) 5 Monash University
Overview
This repository contains the pretrained model weights for MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense object motion from monocular videos.
MotionCrafter simultaneously predicts:
- Dense point maps: 3D coordinates in world space for each pixel
- Scene flow: Per-pixel motion estimation across frames
All predictions are made within a unified world coordinate system, without requiring post-optimization.
Model Weights
This repository includes the following pretrained models:
1. Geometry Motion VAE (geometry_motion_vae/)
- Purpose: Encodes 4D geometry and motion information into a latent space
- Architecture: 4D VAE for joint geometry and motion representation
- Input: Videos with associated geometry and motion annotations
- Output: Compressed 4D latent codes
2. UNet Deterministic (unet_determ/)
- Purpose: Predicts dense geometry and motion from video frames
- Architecture: Deterministic UNet conditioned on video input
- Input: Video frames
- Output: Dense point maps and scene flow predictions
Usage
Basic Usage
Load the pretrained models using the MotionCrafter library:
import torch
from motioncrafter import (
MotionCrafterDiffPipeline,
MotionCrafterDetermPipeline,
UnifyAutoencoderKL,
UNetSpatioTemporalConditionModelVid2vid
)
# Paths to model weights (or use HuggingFace repo ID)
unet_path = "TencentARC/MotionCrafter"
vae_path = "TencentARC/MotionCrafter"
model_type = "determ" # or "diff" for diffusion version
cache_dir = "./pretrained_models"
# Load UNet model for motion generation
unet = UNetSpatioTemporalConditionModelVid2vid.from_pretrained(
unet_path,
subfolder='unet_diff' if model_type == 'diff' else 'unet_determ',
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
cache_dir=cache_dir
).requires_grad_(False).to("cuda", dtype=torch.float16)
# Load geometry and motion VAE for point map decoding
geometry_motion_vae = UnifyAutoencoderKL.from_pretrained(
vae_path,
subfolder='geometry_motion_vae',
low_cpu_mem_usage=True,
torch_dtype=torch.float32,
cache_dir=cache_dir
).requires_grad_(False).to("cuda", dtype=torch.float32)
# Initialize pipeline based on model type
if model_type == 'diff':
pipe = MotionCrafterDiffPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt",
unet=unet,
torch_dtype=torch.float16,
variant="fp16",
cache_dir=cache_dir
).to("cuda")
else:
pipe = MotionCrafterDetermPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt",
unet=unet,
torch_dtype=torch.float16,
variant="fp16",
cache_dir=cache_dir
).to("cuda")
# Your inference code here...
Model Variants
- Deterministic (
unet_determ): Fast inference with fixed predictions per input - Diffusion (
unet_diff): Probabilistic predictions with diverse outputs
For complete inference examples and additional documentation, please refer to the main repository.
Model Details
- Framework: PyTorch
- Model Format:
safetensors(for safe model loading) - Resolution: Supports variable resolutions (e.g., 320ร640, 512ร1024)
- Frame Count: Tested with 25 frames
Citation
If you find MotionCrafter useful for your research, please cite:
License
This model is provided under the Tencent License. Please see LICENSE.txt for details.
Acknowledgments
This work builds upon GeometryCrafter. We thank the authors for their excellent contributions.