TencentARC
/

MotionCrafter

+# MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
+<div align="center">
+[Ruijie Zhu](https://ruijiezhu94.github.io/ruijiezhu/)<sup>1,2</sup>,
+[Jiahao Lu](https://scholar.google.com/citations?user=cRpteW4AAAAJ&hl=en)<sup>3</sup>,
+[Wenbo Hu](https://wbhu.github.io/)<sup>2</sup>,
+[Xiaoguang Han](https://scholar.google.com/citations?user=z-rqsR4AAAAJ&hl=en)<sup>4</sup>,
+[Jianfei Cai](https://jianfei-cai.github.io/)<sup>5</sup>,
+[Ying Shan](https://scholar.google.com/citations?user=4oXBp9UAAAAJ&hl=en)<sup>2</sup>,
+[Chuanxia Zheng](https://physicalvision.github.io/people/~chuanxia)<sup>1</sup>
+<sup>1</sup> NTU &nbsp; <sup>2</sup> ARC Lab, Tencent PCG &nbsp; <sup>3</sup> HKUST &nbsp; <sup>4</sup> CUHK(SZ) &nbsp; <sup>5</sup> Monash University
+[📄 Paper](https://arxiv.org/abs/xxxxx) | [🌐 Project Page](https://ruijiezhu94.github.io/MotionCrafter_Page/) | [💻 Code](https://github.com/TencentARC/MotionCrafter) | [📜 License](LICENSE.txt)
+</div>
+---
+## Overview
+This repository contains the pretrained model weights for **MotionCrafter**, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense object motion from monocular videos.
+MotionCrafter simultaneously predicts:
+- **Dense point maps**: 3D coordinates in world space for each pixel
+- **Scene flow**: Per-pixel motion estimation across frames
+All predictions are made within a unified world coordinate system, without requiring post-optimization.
+## Model Weights
+This repository includes the following pretrained models:
+### 1. Geometry Motion VAE (`geometry_motion_vae/`)
+- **Purpose**: Encodes 4D geometry and motion information into a latent space
+- **Architecture**: 4D VAE for joint geometry and motion representation
+- **Input**: Videos with associated geometry and motion annotations
+- **Output**: Compressed 4D latent codes
+### 2. UNet Deterministic (`unet_determ/`)
+- **Purpose**: Predicts dense geometry and motion from video frames
+- **Architecture**: Deterministic UNet conditioned on video input
+- **Input**: Video frames
+- **Output**: Dense point maps and scene flow predictions
+## Usage
+### Basic Usage
+Load the pretrained models using the MotionCrafter library:
+```python
+import torch
+from motioncrafter import (
+    MotionCrafterDiffPipeline,
+    MotionCrafterDetermPipeline,
+    UnifyAutoencoderKL,
+    UNetSpatioTemporalConditionModelVid2vid
+)
+# Paths to model weights (or use HuggingFace repo ID)
+unet_path = "TencentARC/MotionCrafter"
+vae_path = "TencentARC/MotionCrafter"
+model_type = "determ"  # or "diff" for diffusion version
+cache_dir = "./pretrained_models"
+# Load UNet model for motion generation
+unet = UNetSpatioTemporalConditionModelVid2vid.from_pretrained(
+    unet_path,
+    subfolder='unet_diff' if model_type == 'diff' else 'unet_determ',
+    low_cpu_mem_usage=True,
+    torch_dtype=torch.float16,
+    cache_dir=cache_dir
+).requires_grad_(False).to("cuda", dtype=torch.float16)
+# Load geometry and motion VAE for point map decoding
+geometry_motion_vae = UnifyAutoencoderKL.from_pretrained(
+    vae_path,
+    subfolder='geometry_motion_vae',
+    low_cpu_mem_usage=True,
+    torch_dtype=torch.float32,
+    cache_dir=cache_dir
+).requires_grad_(False).to("cuda", dtype=torch.float32)
+# Initialize pipeline based on model type
+if model_type == 'diff':
+    pipe = MotionCrafterDiffPipeline.from_pretrained(
+        "stabilityai/stable-video-diffusion-img2vid-xt",
+        unet=unet,
+        torch_dtype=torch.float16,
+        variant="fp16",
+        cache_dir=cache_dir
+    ).to("cuda")
+else:
+    pipe = MotionCrafterDetermPipeline.from_pretrained(
+        "stabilityai/stable-video-diffusion-img2vid-xt",
+        unet=unet,
+        torch_dtype=torch.float16,
+        variant="fp16",
+        cache_dir=cache_dir
+    ).to("cuda")
+# Your inference code here...
+```
+### Model Variants
+- **Deterministic (`unet_determ`)**: Fast inference with fixed predictions per input
+- **Diffusion (`unet_diff`)**: Probabilistic predictions with diverse outputs
+For complete inference examples and additional documentation, please refer to the [main repository](https://github.com/TencentARC/MotionCrafter).
+## Model Details
+- **Framework**: PyTorch
+- **Model Format**: `safetensors` (for safe model loading)
+- **Resolution**: Supports variable resolutions (e.g., 320×640, 512×1024)
+- **Frame Count**: Tested with 25 frames
+## Citation
+If you find MotionCrafter useful for your research, please cite:
+```bibtex
+```
+## License
+This model is provided under the Tencent License. Please see [LICENSE.txt](LICENSE.txt) for details.
+## Acknowledgments
+This work builds upon [GeometryCrafter](https://github.com/TencentARC/GeometryCrafter). We thank the authors for their excellent contributions.