nielsr HF Staff

Add pipeline tag and library name to model card

0db206f verified 11 months ago

5.06 kB

	---
	license: apache-2.0
	library_name: diffusers
	pipeline_tag: image-to-video
	---

	<meta name="google-site-verification" content="-XQC-POJtlDPD3i2KSOxbFkSBde_Uq9obAIh_4mxTkM" />

	<div align="center">

	<h2><a href="https://www.arxiv.org/abs/2505.10238">MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation</a></h2>

	> Official project page of MTVCrafter, a novel framework for general and high-quality human image animation using raw 3D motion sequences.

	<!--
	[Yanbo Ding](https://github.com/DINGYANB),
	[Shaobin Zhuang](https://scholar.google.com/citations?user=PGaDirMAAAAJ&hl=zh-CN&oi=ao),
	[Kunchang Li](https://scholar.google.com/citations?user=D4tLSbsAAAAJ),
	[Zhengrong Yue](https://arxiv.org/search/?searchtype=author&query=Zhengrong%20Yue),
	[Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ&hl),
	[Yali Wang†](https://scholar.google.com/citations?user=hD948dkAAAAJ)
	-->

	🔗 [Project Page](https://dingyanb.github.io/MTVCtafter/) \|
	📄 [ArXiv](https://arxiv.org/abs/2505.10238) \|
	💻 [Code](https://github.com/DINGYANB/MTVCrafter) \|
	🤗 [Hugging Face Model](https://huggingface.co/yanboding/MTVCrafter)

	</div>


	## 🔍 Abstract

	Human image animation has attracted increasing attention and developed rapidly due to its broad applications in digital humans. However, existing methods rely on 2D-rendered pose images for motion guidance, which limits generalization and discards essential 3D information.
	To tackle these problems, we propose MTVCrafter (Motion Tokenization Video Crafter), the first framework that directly models raw 3D motion sequences for open-world human image animation beyond intermediate 2D representations.

	- We introduce 4DMoT (4D motion tokenizer) to encode raw motion data into discrete motion tokens, preserving 4D compact yet expressive spatio-temporal information.
	- Then, we propose MV-DiT (Motion-aware Video DiT), which integrates a motion attention module and 4D positional encodings to effectively modulate vision tokens with motion tokens.
	- The overall pipeline facilitates high-quality human video generation guided by 4D motion tokens.

	MTVCrafter achieves state-of-the-art results with an FID-VID of 6.98, outperforming the second-best by approximately 65%. It generalizes well to diverse characters (single/multiple, full/half-body) across various styles.

	## 🎯 Motivation

	![Motivation](./static/images/Motivation.png)

	Our motivation is that directly tokenizing 4D motion captures more faithful and expressive information than traditional 2D-rendered pose images derived from the driven video.

	## 💡 Method

	![Method](./static/images/4DMoT.png)

	(1) 4DMoT:
	Our 4D motion tokenizer consists of an encoder-decoder framework to learn spatio-temporal latent representations of SMPL motion sequences,
	and a vector quantizer to learn discrete tokens in a unified space.
	All operations are performed in 2D space along frame and joint axes.

	![Method](./static/images/MV-DiT.png)

	(2) MV-DiT:
	Based on video DiT architecture,
	we design a 4D motion attention module to combine motion tokens with vision tokens.
	Since the tokenization and flattening disrupted positional information,
	we introduce 4D RoPE to recover the spatio-temporal relationships.
	To further improve the quality of generation and generalization,
	we use learnable unconditional tokens for motion classifier-free guidance.

	---

	## 🛠️ Installation

	We recommend using a clean Python environment (Python 3.10+).

	```bash
	clone this repository && cd MTVCrafter-main

	# Create virtual environment
	conda create -n mtvcrafter python=3.11
	conda activate mtvcrafter

	# Install dependencies
	pip install -r requirements.txt
	```

	## 🚀 Usage

	To animate a human image with a given 3D motion sequence,
	you first need to obtain the SMPL motion sequnces from the driven video:

	```bash
	python process_nlf.py "your_video_directory"
	```

	Then, you can use the following command to animate the image guided by 4D motion tokens:

	```bash
	python infer.py --ref_image_path "ref_images/hunam.png" --motion_data_path "data/sample_data.pkl" --output_path "inference_output"
	```

	- `--ref_image_path`: Path to the image of reference character.
	- `--motion_data_path`: Path to the motion sequence (.pkl format).
	- `--output_path`: Where to save the generated animation results.

	For our 4DMoT, you can run the following command to train the model on your dataset:

	```bash
	accelerate launch train_vqvae.py
	```

	## 📄 Citation

	If you find our work useful, please consider citing:

	```bibtex
	@misc{ding2025mtvcrafter4dmotiontokenization,
	title={MTVCrafter: 4D Motion Tokenization for Open-World Human Image Animation},
	author={Yanbo Ding and Xirui Hu and Zhizhi Guo and Yali Wang},
	year={2025},
	eprint={2505.10238},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2505.10238},
	}
	```

	## 📬 Contact

	For questions or collaboration, feel free to reach out via GitHub Issues
	or email me at 📧 yb.ding@siat.ac.cn.