MotionCrafter / README.md
RuijieZhu's picture
update readme
b97dcb7
|
raw
history blame
4.77 kB

MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE

Ruijie Zhu1,2, Jiahao Lu3, Wenbo Hu2, Xiaoguang Han4, Jianfei Cai5, Ying Shan2, Chuanxia Zheng1

1 NTU   2 ARC Lab, Tencent PCG   3 HKUST   4 CUHK(SZ)   5 Monash University

๐Ÿ“„ Paper | ๐ŸŒ Project Page | ๐Ÿ’ป Code | ๐Ÿ“œ License


Overview

This repository contains the pretrained model weights for MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense object motion from monocular videos.

MotionCrafter simultaneously predicts:

  • Dense point maps: 3D coordinates in world space for each pixel
  • Scene flow: Per-pixel motion estimation across frames

All predictions are made within a unified world coordinate system, without requiring post-optimization.

Model Weights

This repository includes the following pretrained models:

1. Geometry Motion VAE (geometry_motion_vae/)

  • Purpose: Encodes 4D geometry and motion information into a latent space
  • Architecture: 4D VAE for joint geometry and motion representation
  • Input: Videos with associated geometry and motion annotations
  • Output: Compressed 4D latent codes

2. UNet Deterministic (unet_determ/)

  • Purpose: Predicts dense geometry and motion from video frames
  • Architecture: Deterministic UNet conditioned on video input
  • Input: Video frames
  • Output: Dense point maps and scene flow predictions

Usage

Basic Usage

Load the pretrained models using the MotionCrafter library:

import torch
from motioncrafter import (
    MotionCrafterDiffPipeline,
    MotionCrafterDetermPipeline,
    UnifyAutoencoderKL,
    UNetSpatioTemporalConditionModelVid2vid
)

# Paths to model weights (or use HuggingFace repo ID)
unet_path = "TencentARC/MotionCrafter"
vae_path = "TencentARC/MotionCrafter"
model_type = "determ"  # or "diff" for diffusion version
cache_dir = "./pretrained_models"

# Load UNet model for motion generation
unet = UNetSpatioTemporalConditionModelVid2vid.from_pretrained(
    unet_path,
    subfolder='unet_diff' if model_type == 'diff' else 'unet_determ',
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    cache_dir=cache_dir
).requires_grad_(False).to("cuda", dtype=torch.float16)

# Load geometry and motion VAE for point map decoding
geometry_motion_vae = UnifyAutoencoderKL.from_pretrained(
    vae_path,
    subfolder='geometry_motion_vae',
    low_cpu_mem_usage=True,
    torch_dtype=torch.float32,
    cache_dir=cache_dir
).requires_grad_(False).to("cuda", dtype=torch.float32)

# Initialize pipeline based on model type
if model_type == 'diff':
    pipe = MotionCrafterDiffPipeline.from_pretrained(
        "stabilityai/stable-video-diffusion-img2vid-xt",
        unet=unet,
        torch_dtype=torch.float16,
        variant="fp16",
        cache_dir=cache_dir
    ).to("cuda")
else:
    pipe = MotionCrafterDetermPipeline.from_pretrained(
        "stabilityai/stable-video-diffusion-img2vid-xt",
        unet=unet,
        torch_dtype=torch.float16,
        variant="fp16",
        cache_dir=cache_dir
    ).to("cuda")

# Your inference code here...

Model Variants

  • Deterministic (unet_determ): Fast inference with fixed predictions per input
  • Diffusion (unet_diff): Probabilistic predictions with diverse outputs

For complete inference examples and additional documentation, please refer to the main repository.

Model Details

  • Framework: PyTorch
  • Model Format: safetensors (for safe model loading)
  • Resolution: Supports variable resolutions (e.g., 320ร—640, 512ร—1024)
  • Frame Count: Tested with 25 frames

Citation

If you find MotionCrafter useful for your research, please cite:


License

This model is provided under the Tencent License. Please see LICENSE.txt for details.

Acknowledgments

This work builds upon GeometryCrafter. We thank the authors for their excellent contributions.