MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

Ruijie Zhu^1,2, Jiahao Lu³, Wenbo Hu², Xiaoguang Han⁴, Jianfei Cai⁵, Ying Shan², Chuanxia Zheng¹

¹ NTU ² ARC Lab, Tencent PCG ³ HKUST ⁴ CUHK(SZ) ⁵ Monash University

📄 Paper | 🌐 Project Page | 💻 Code | 📜 License

Overview

This repository contains the pretrained model weights for MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense object motion from monocular videos.

MotionCrafter simultaneously predicts:

Dense point maps: 3D coordinates in world space for each pixel
Scene flow: Per-pixel motion estimation across frames

All predictions are made within a unified world coordinate system, without requiring post-optimization.

Model Weights

This repository includes the following pretrained models:

1. Geometry Motion VAE (`geometry_motion_vae/`)

Purpose: Encodes 4D geometry and motion information into a latent space
Architecture: 4D VAE for joint geometry and motion representation
Input: Videos with associated geometry and motion annotations
Output: Compressed 4D latent codes

2. UNet Deterministic (`unet_determ/`)

Purpose: Predicts dense geometry and motion from video frames
Architecture: Deterministic UNet conditioned on video input
Input: Video frames
Output: Dense point maps and scene flow predictions

Usage

Basic Usage

Load the pretrained models using the MotionCrafter library:

import torch
from motioncrafter import (
    MotionCrafterDiffPipeline,
    MotionCrafterDetermPipeline,
    UnifyAutoencoderKL,
    UNetSpatioTemporalConditionModelVid2vid
)

# Paths to model weights (or use HuggingFace repo ID)
unet_path = "TencentARC/MotionCrafter"
vae_path = "TencentARC/MotionCrafter"
model_type = "determ"  # or "diff" for diffusion version
cache_dir = "./pretrained_models"

# Load UNet model for motion generation
unet = UNetSpatioTemporalConditionModelVid2vid.from_pretrained(
    unet_path,
    subfolder='unet_diff' if model_type == 'diff' else 'unet_determ',
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    cache_dir=cache_dir
).requires_grad_(False).to("cuda", dtype=torch.float16)

# Load geometry and motion VAE for point map decoding
geometry_motion_vae = UnifyAutoencoderKL.from_pretrained(
    vae_path,
    subfolder='geometry_motion_vae',
    low_cpu_mem_usage=True,
    torch_dtype=torch.float32,
    cache_dir=cache_dir
).requires_grad_(False).to("cuda", dtype=torch.float32)

# Initialize pipeline based on model type
if model_type == 'diff':
    pipe = MotionCrafterDiffPipeline.from_pretrained(
        "stabilityai/stable-video-diffusion-img2vid-xt",
        unet=unet,
        torch_dtype=torch.float16,
        variant="fp16",
        cache_dir=cache_dir
    ).to("cuda")
else:
    pipe = MotionCrafterDetermPipeline.from_pretrained(
        "stabilityai/stable-video-diffusion-img2vid-xt",
        unet=unet,
        torch_dtype=torch.float16,
        variant="fp16",
        cache_dir=cache_dir
    ).to("cuda")

# Your inference code here...

Model Variants

Deterministic (unet_determ): Fast inference with fixed predictions per input
Diffusion (unet_diff): Probabilistic predictions with diverse outputs

For complete inference examples and additional documentation, please refer to the main repository.

Model Details

Framework: PyTorch
Model Format: safetensors (for safe model loading)
Resolution: Supports variable resolutions (e.g., 320×640, 512×1024)
Frame Count: Tested with 25 frames

Citation

If you find MotionCrafter useful for your research, please cite:

License

This model is provided under the Tencent License. Please see LICENSE.txt for details.

Acknowledgments

This work builds upon GeometryCrafter. We thank the authors for their excellent contributions.