| --- |
| license: apache-2.0 |
| library_name: diffusers |
| pipeline_tag: image-to-video |
| --- |
| |
| <meta name="google-site-verification" content="-XQC-POJtlDPD3i2KSOxbFkSBde_Uq9obAIh_4mxTkM" /> |
|
|
| <div align="center"> |
|
|
| <h2><a href="https://www.arxiv.org/abs/2505.10238">MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation</a></h2> |
|
|
| > Official project page of **MTVCrafter**, a novel framework for general and high-quality human image animation using raw 3D motion sequences. |
|
|
| <!-- |
| [Yanbo Ding](https://github.com/DINGYANB), |
| [Shaobin Zhuang](https://scholar.google.com/citations?user=PGaDirMAAAAJ&hl=zh-CN&oi=ao), |
| [Kunchang Li](https://scholar.google.com/citations?user=D4tLSbsAAAAJ), |
| [Zhengrong Yue](https://arxiv.org/search/?searchtype=author&query=Zhengrong%20Yue), |
| [Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ&hl), |
| [Yali Wangβ ](https://scholar.google.com/citations?user=hD948dkAAAAJ) |
| --> |
|
|
| π [Project Page](https://dingyanb.github.io/MTVCtafter/) | |
| π [ArXiv](https://arxiv.org/abs/2505.10238) | |
| π» [Code](https://github.com/DINGYANB/MTVCrafter) | |
| π€ [Hugging Face Model](https://huggingface.co/yanboding/MTVCrafter) |
|
|
| </div> |
|
|
|
|
| ## π Abstract |
|
|
| Human image animation has attracted increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information. |
| To tackle these problems, we propose **MTVCrafter (Motion Tokenization Video Crafter)**, the first framework that directly models raw 3D motion sequences for open-world human image animation beyond intermediate 2D representations. |
|
|
| - We introduce **4DMoT (4D motion tokenizer)** to encode raw motion data into discrete motion tokens, preserving 4D compact yet expressive spatio-temporal information. |
| - Then, we propose **MV-DiT (Motion-aware Video DiT)**, which integrates a motion attention module and 4D positional encodings to effectively modulate vision tokens with motion tokens. |
| - The overall pipeline facilitates high-quality human video generation guided by 4D motion tokens. |
|
|
| MTVCrafter achieves **state-of-the-art results with an FID-VID of 6.98**, outperforming the second-best by approximately **65%**. It generalizes well to diverse characters (single/multiple, full/half-body) across various styles. |
|
|
| ## π― Motivation |
|
|
|  |
|
|
| Our motivation is that directly tokenizing 4D motion captures more faithful and expressive information than traditional 2D-rendered pose images derived from the driven video. |
|
|
| ## π‘ Method |
|
|
|  |
|
|
| *(1) 4DMoT*: |
| Our 4D motion tokenizer consists of an encoder-decoder framework to learn spatio-temporal latent representations of SMPL motion sequences, |
| and a vector quantizer to learn discrete tokens in a unified space. |
| All operations are performed in 2D space along frame and joint axes. |
|
|
|  |
|
|
| *(2) MV-DiT*: |
| Based on video DiT architecture, |
| we design a 4D motion attention module to combine motion tokens with vision tokens. |
| Since the tokenization and flattening disrupted positional information, |
| we introduce 4D RoPE to recover the spatio-temporal relationships. |
| To further improve the quality of generation and generalization, |
| we use learnable unconditional tokens for motion classifier-free guidance. |
|
|
| --- |
|
|
| ## π οΈ Installation |
|
|
| We recommend using a clean Python environment (Python 3.10+). |
|
|
| ```bash |
| clone this repository && cd MTVCrafter-main |
| |
| # Create virtual environment |
| conda create -n mtvcrafter python=3.11 |
| conda activate mtvcrafter |
| |
| # Install dependencies |
| pip install -r requirements.txt |
| ``` |
|
|
| ## π Usage |
|
|
| To animate a human image with a given 3D motion sequence, |
| you first need to obtain the SMPL motion sequnces from the driven video: |
|
|
| ```bash |
| python process_nlf.py "your_video_directory" |
| ``` |
|
|
| Then, you can use the following command to animate the image guided by 4D motion tokens: |
|
|
| ```bash |
| python infer.py --ref_image_path "ref_images/hunam.png" --motion_data_path "data/sample_data.pkl" --output_path "inference_output" |
| ``` |
|
|
| - `--ref_image_path`: Path to the image of reference character. |
| - `--motion_data_path`: Path to the motion sequence (.pkl format). |
| - `--output_path`: Where to save the generated animation results. |
|
|
| For our 4DMoT, you can run the following command to train the model on your dataset: |
|
|
| ```bash |
| accelerate launch train_vqvae.py |
| ``` |
|
|
| ## π Citation |
|
|
| If you find our work useful, please consider citing: |
|
|
| ```bibtex |
| @misc{ding2025mtvcrafter4dmotiontokenization, |
| title={MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation}, |
| author={Yanbo Ding and Xirui Hu and Zhizhi Guo and Yali Wang}, |
| year={2025}, |
| eprint={2505.10238}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CV}, |
| url={https://arxiv.org/abs/2505.10238}, |
| } |
| ``` |
|
|
| ## π¬ Contact |
|
|
| For questions or collaboration, feel free to reach out via GitHub Issues |
| or email me at π§ yb.ding@siat.ac.cn. |