TencentARC
/

MotionCrafter

@@ -1,3 +1,16 @@
 # MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
@@ -17,40 +30,32 @@
 </div>
----
-## Overview
-This repository contains the pretrained model weights for **MotionCrafter**, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense object motion from monocular videos.
-MotionCrafter simultaneously predicts:
-- **Dense point maps**: 3D coordinates in world space for each pixel
-- **Scene flow**: Per-pixel motion estimation across frames
-All predictions are made within a unified world coordinate system, without requiring post-optimization.
-## Model Weights
-This repository includes the following pretrained models:
-### 1. Geometry Motion VAE (`geometry_motion_vae/`)
-- **Purpose**: Encodes 4D geometry and motion information into a latent space
-- **Architecture**: 4D VAE for joint geometry and motion representation
-- **Input**: Videos with associated geometry and motion annotations
-- **Output**: Compressed 4D latent codes
-### 2. UNet Deterministic (`unet_determ/`)
-- **Purpose**: Predicts dense geometry and motion from video frames
-- **Architecture**: Deterministic UNet conditioned on video input
-- **Input**: Video frames
-- **Output**: Dense point maps and scene flow predictions
-## Usage
-### Basic Usage
-Load the pretrained models using the MotionCrafter library:
 ```python
 import torch
@@ -61,13 +66,11 @@ from motioncrafter import (
     UNetSpatioTemporalConditionModelVid2vid
 )
-# Paths to model weights (or use HuggingFace repo ID)
 unet_path = "TencentARC/MotionCrafter"
 vae_path = "TencentARC/MotionCrafter"
 model_type = "determ"  # or "diff" for diffusion version
 cache_dir = "./pretrained_models"
-# Load UNet model for motion generation
 unet = UNetSpatioTemporalConditionModelVid2vid.from_pretrained(
     unet_path,
     subfolder='unet_diff' if model_type == 'diff' else 'unet_determ',
@@ -76,7 +79,6 @@ unet = UNetSpatioTemporalConditionModelVid2vid.from_pretrained(
     cache_dir=cache_dir
 ).requires_grad_(False).to("cuda", dtype=torch.float16)
-# Load geometry and motion VAE for point map decoding
 geometry_motion_vae = UnifyAutoencoderKL.from_pretrained(
     vae_path,
     subfolder='geometry_motion_vae',
@@ -85,7 +87,6 @@ geometry_motion_vae = UnifyAutoencoderKL.from_pretrained(
     cache_dir=cache_dir
 ).requires_grad_(False).to("cuda", dtype=torch.float32)
-# Initialize pipeline based on model type
 if model_type == 'diff':
     pipe = MotionCrafterDiffPipeline.from_pretrained(
         "stabilityai/stable-video-diffusion-img2vid-xt",
@@ -102,38 +103,38 @@ else:
         variant="fp16",
         cache_dir=cache_dir
     ).to("cuda")
-# Your inference code here...
 ```
-### Model Variants
-- **Deterministic (`unet_determ`)**: Fast inference with fixed predictions per input
-- **Diffusion (`unet_diff`)**: Probabilistic predictions with diverse outputs
-For complete inference examples and additional documentation, please refer to the [main repository](https://github.com/TencentARC/MotionCrafter).
-## Model Details
-- **Framework**: PyTorch
-- **Model Format**: `safetensors` (for safe model loading)
-- **Resolution**: Supports variable resolutions (e.g., 320×640, 512×1024)
-- **Frame Count**: Tested with 25 frames
 ## Citation
-If you find MotionCrafter useful for your research, please cite:
 ```bibtex
 ```
 ## License
-This model is provided under the Tencent License. Please see [LICENSE.txt](LICENSE.txt) for details.
 ## Acknowledgments
 This work builds upon [GeometryCrafter](https://github.com/TencentARC/GeometryCrafter). We thank the authors for their excellent contributions.

+---
+language: [en]
+license: other
+library_name: motioncrafter
+tags:
+- motion
+- video
+- 4d
+- diffusion
+- scene-flow
+pipeline_tag: image-to-3d
+base_model: stabilityai/stable-video-diffusion-img2vid-xt
+---
 # MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
 </div>
+## Model Description
+MotionCrafter is a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense object motion from monocular videos. It predicts dense point maps and scene flow for each frame within a shared world coordinate system, without requiring post-optimization.
+## Intended Use
+- Research on 4D reconstruction and motion estimation from monocular videos
+- Academic evaluation and benchmarking of dense point map and scene flow prediction
+Not intended for safety-critical or real-time production use.
+## Limitations
+- Performance can degrade with extreme motion blur or severe occlusion.
+- Output quality is sensitive to input resolution and video quality.
+- Generalization may be limited for out-of-domain scenes.
+## Training Data
+Training data details and preprocessing are described in the paper and main repository. If you need dataset specifics, please refer to the project page and the paper.
+## Evaluation
+Please refer to the paper for evaluation datasets, metrics, and results.
+## How to Use
 ```python
 import torch
     UNetSpatioTemporalConditionModelVid2vid
 )
 unet_path = "TencentARC/MotionCrafter"
 vae_path = "TencentARC/MotionCrafter"
 model_type = "determ"  # or "diff" for diffusion version
 cache_dir = "./pretrained_models"
 unet = UNetSpatioTemporalConditionModelVid2vid.from_pretrained(
     unet_path,
     subfolder='unet_diff' if model_type == 'diff' else 'unet_determ',
     cache_dir=cache_dir
 ).requires_grad_(False).to("cuda", dtype=torch.float16)
 geometry_motion_vae = UnifyAutoencoderKL.from_pretrained(
     vae_path,
     subfolder='geometry_motion_vae',
     cache_dir=cache_dir
 ).requires_grad_(False).to("cuda", dtype=torch.float32)
 if model_type == 'diff':
     pipe = MotionCrafterDiffPipeline.from_pretrained(
         "stabilityai/stable-video-diffusion-img2vid-xt",
         variant="fp16",
         cache_dir=cache_dir
     ).to("cuda")
 ```
+## Model Weights
+- geometry_motion_vae/: 4D VAE for joint geometry and motion representation
+- unet_determ/: deterministic UNet for motion prediction
+## Model Variants
+- Deterministic (unet_determ): fast inference with fixed predictions per input
+- Diffusion (unet_diff): probabilistic predictions with diverse outputs
 ## Citation
 ```bibtex
+@inproceedings{zhu2025motioncrafter,
+  title={MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE},
+  author={Zhu, Ruijie and Lu, Jiahao and Hu, Wenbo and Han, Xiaoguang and Cai, Jianfei and Shan, Ying and Zheng, Chuanxia},
+  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
+  year={2025}
+}
 ```
 ## License
+This model is provided under the Tencent License. See [LICENSE.txt](LICENSE.txt) for details.
 ## Acknowledgments
 This work builds upon [GeometryCrafter](https://github.com/TencentARC/GeometryCrafter). We thank the authors for their excellent contributions.
+This work builds upon [GeometryCrafter](https://github.com/TencentARC/GeometryCrafter). We thank the authors for their excellent contributions.