FlashPortrait / README.md

Update README.md

2c3aac1 verified 8 days ago

19.5 kB

	---
	base_model:
	- Wan-AI/Wan2.1-I2V-14B-720P
	library_name: diffusers
	license: apache-2.0
	pipeline_tag: image-to-video
	tags:
	- video-generation
	- video diffusion transformer
	- audio-driven avatar animation
	task_categories:
	- image-to-video
	- text-to-video
	---

	# FlashPortrait

	<a href='https://francis-rings.github.io/FlashPortrait'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/2512.16900'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://github.com/Francis-Rings/FlashPortrait'><img src='https://img.shields.io/badge/GitHub-Code-blue?logo=github'></a> <a href='https://www.youtube.com/watch?v=woSzRXlXyiY'><img src='https://img.shields.io/badge/YouTube-Watch-red?style=flat-square&logo=youtube'></a> <a href='https://www.bilibili.com/video/BV1Gfq9BAEvo'><img src='https://img.shields.io/badge/Bilibili-Watch-blue?style=flat-square&logo=bilibili'></a>

	FlashPortrait: 6$\times$ Faster Infinite Portrait Animation with Adaptive Latent Prediction
	<br/>
	Shuyuan Tu<sup>1</sup>, Yueming Pan<sup>3</sup>, Yinming Huang<sup>1</sup>, Xintong Han<sup>4</sup>, Zhen Xing<sup>5</sup>, Qi Dai<sup>2</sup>, Kai Qiu<sup>2</sup>, Chong Luo<sup>2</sup>, Zuxuan Wu<sup>1</sup>
	<br/>
	[<sup>1</sup>Fudan University; <sup>2</sup>Microsoft Research Asia; <sup>3</sup>Xi'an Jiaotong University; <sup>4</sup>Tencent Inc; <sup>5</sup>Wan Team, Tongyi Lab, Alibaba Group]



	<table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
	<tr>
	<td>
	<video src="https://github.com/user-attachments/assets/f052b880-28b5-4a59-8100-77318a9e8425" width="320" controls loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/b698d2e7-4c90-4e95-b24f-38b53514470b" width="320" controls loop></video>
	</td>
	</tr>
	<tr>
	<td>
	<video src="https://github.com/user-attachments/assets/58f4a67f-8f1f-401c-90e2-50479bf81dfb" width="320" controls loop></video>
	</td>
	<td>
	<video src="https://github.com/user-attachments/assets/894fe221-fb09-4422-aa8f-46ce31edf1b4" width="320" controls loop></video>
	</td>
	</tr>
	</table>

	<p style="text-align: justify;">
	<span>Portrait animations generated by FlashPortrait, showing its power to synthesize <b>infinite-length ID-preserving animations</b>. All videos are <b>directly synthesized by FlashPortrait without the use of any face-related post-processing tools</b>, such as the face-swapping tool FaceFusion or face restoration models like GFP-GAN and CodeFormer.</span>
	</p>


	<p align="center">
	<video src="https://github.com/user-attachments/assets/20f34576-0689-4be2-99b1-aee550f07641" width="768" autoplay loop muted playsinline></video>
	<video src="https://github.com/user-attachments/assets/fb63eb2d-c8bb-49a4-bac1-ab8ef2c96841" width="768" autoplay loop muted playsinline></video>
	<br/>
	<span>Comparison results between FlashPortrait and state-of-the-art (SOTA) portrait animation models highlight the superior performance of FlashPortrait in delivering <b>infinite-length, high-fidelity, identity-preserving portrait animation</b>.</span>
	</p>


	## Overview

	<p align="center">
	<img src="assets/figures/framework.jpg" alt="model architecture" width="1280"/>
	</br>
	<i>The overview of the framework of FlashPortrait.</i>
	</p>

	Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6$\times$ acceleration in inference speed.
	In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor.
	It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling.
	During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait
	utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6$\times$ speed acceleration.
	Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.

	## News
	* `[2025-12-15]`:🔥 The project page, code, technical report and [a basic model checkpoint](https://huggingface.co/FrancisRing/FlashPortrait/tree/main) are released. Further acceleration part (Adaptive Latent Prediction) will be released very soon. Stay tuned!

	## 🛠️ To-Do List
	- [x] FlashPortrait-14B
	- [x] Inference Code
	- [x] Training Code
	- [ ] Multiple-GPU Inference Code
	- [ ] Inference Code with Adaptive Latent Prediction

	## 🔑 Quickstart

	FlashPortrait supports generating <b>infinite-length videos at a 480x832 or 832x480 or 512x512 or 720x720 or 720x1280 or 1280x720 resolution</b>. If you encounter insufficient memory issues, you can appropriately reduce the number of animated frames or the resolution of the output.

	### 🧱 Environment setup

	```
	pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu124
	pip install -r requirements.txt
	# Optional to install flash_attn to accelerate attention computation
	pip install flash_attn
	```

	### 🧱 Download weights
	If you encounter connection issues with Hugging Face, you can utilize the mirror endpoint by setting the environment variable: `export HF_ENDPOINT=https://hf-mirror.com`.
	Please download weights manually as follows:
	```
	pip install "huggingface_hub[cli]"
	cd FlashPortrait
	mkdir checkpoints
	huggingface-cli download FrancisRing/FlashPortrait --local-dir ./checkpoints/FlashPortrait
	huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./checkpoints/Wan2.1-I2V-14B-720P
	```
	All the weights should be organized in models as follows
	The overall file structure of this project should be organized as follows:
	```
	FlashPortrait/
	├── config
	├── examples
	├── wan
	├── checkpoints
	│ ├── FlashPortrait
	│ └── Wan2.1-I2V-14B-720P
	├── infer.py
	├── fast_infer.py
	├── train_portrait.py
	├── bin_convert_pt.py
	├── train_single_machine.sh
	├── train_multiple_machine.sh
	├── requirement.txt
	```

	### 🧱 Model inference
	A sample configuration for testing is provided as `infer.py` and `fast_infer.py`. You can also easily modify the various configurations according to your needs.

	```
	bash inference.sh
	```
	Wan2.1-14B-based FlashPortrait supports video-driven portrait video generation at various resolution settings: 512x512, 480x832, 832x480, 720x720, 720x1280, and 1280x720. You can modify "max_size" in `infer.py` to set the resolution of the animation. "--validation_image_start", "--validation_driven_video_path", and "--prompt" in `infer.py` refer to the path of the given reference image, the path of the driven audio, and the text prompts respectively.
	Prompts are also very important. It is recommended to `[Description of first frame]-[Description of human behavior]-[Description of background (optional)]`.
	"--wan_model_name", "--transformer_path", and "--portrait_encoder_path" in `infer.py` are the paths of pretrained Wan2.1-14B weights, pretrained FlashPortrait DiT weights, and pretrained FlashPortrait Portrait Encoder weights, respectively.
	"--num_inference_steps", "--sub_num_frames", "--latents_num_frames", "--context_overlap" and "--context_size" refer to the total number of inference steps, the synthesized rgb frame number in a batch, the synthesized latent frame number in a batch, the overlapping context length between two context windows, the synthesized latent frame number in a context window, respectively.
	Notably, the recommended `--num_inference_steps` range is [30-50], more steps bring higher quality. The recommended `--context_overlap` range is [10-40], as longer overlapping length results in higher quality and slower inference speed.
	"--text_cfg_scale" and "--emo_cfg_scale" are Classify-Free-Guidance scale of text prompt and portrait emotion. The recommended range for prompt and audio cfg is `[2-5]`. You can increase the emotion cfg to facilitate the emotion synchronization with the driven video.

	We provide 6 cases in different resolution settings in `path/FlashPortrait/examples` for validation. ❤️❤️Please feel free to try it out and enjoy the endless entertainment of infinite-length portrait video generation❤️❤️!

	#### 💡Tips
	- `fast_infer.py` has faster inference speed, which has the same configuration settings as `infer.py`.

	- If you have limited GPU resources, you can change the loading mode of FlashPortrait by modifying "--GPU_memory_mode" in `infer.py`. The options of "--GPU_memory_mode" are `model_full_load`, `sequential_cpu_offload`, `model_cpu_offload_and_qfloat8`, and `model_cpu_offload`. In particular, when you set `--GPU_memory_mode` to `sequential_cpu_offload`, the total GPU memory consumption is approximately 10G with slower inference speed.
	Setting `--GPU_memory_mode` to `model_cpu_offload` can significantly cut GPU memory usage, reducing it by roughly half compared to `model_full_load` mode.

	- higher resolution setting will result in higher quality synthesized videos (480p->720p).

	### 🧱 Model Training
	<b>🔥🔥It’s worth noting that if you’re looking to train a conditioned Video Diffusion Transformer (DiT) model, such as Wan2.1, this training tutorial will also be helpful.🔥🔥</b>
	For the training dataset, it has to be organized as follows:

	```
	poirtrait_data/
	├── rec
	│ │ ├──speech
	│ │ │ ├──00001
	│ │ │ │ ├──images
	│ │ │ │ │ ├──frame_0.png
	│ │ │ │ │ ├──frame_1.png
	│ │ │ │ │ ├──frame_2.png
	│ │ │ │ │ ├──...
	│ │ │ │ ├──face_masks
	│ │ │ │ │ ├──frame_0.png
	│ │ │ │ │ ├──frame_1.png
	│ │ │ │ │ ├──frame_2.png
	│ │ │ │ │ ├──...
	│ │ │ │ ├──lip_masks
	│ │ │ │ │ ├──frame_0.png
	│ │ │ │ │ ├──frame_1.png
	│ │ │ │ │ ├──frame_2.png
	│ │ │ │ │ ├──...
	│ │ │ ├──00002
	│ │ │ │ ├──images
	│ │ │ │ ├──face_masks
	│ │ │ │ ├──lip_masks
	│ │ │ └──...
	│ │ ├──singing
	│ │ │ ├──00001
	│ │ │ │ ├──images
	│ │ │ │ ├──face_masks
	│ │ │ │ ├──lip_masks
	│ │ │ └──...
	│ │ ├──dancing
	│ │ │ ├──00001
	│ │ │ │ ├──images
	│ │ │ │ ├──face_masks
	│ │ │ │ ├──lip_masks
	│ │ │ └──...
	├── vec
	│ │ ├──speech
	│ │ │ ├──00001
	│ │ │ │ ├──images
	│ │ │ │ ├──face_masks
	│ │ │ │ ├──lip_masks
	│ │ │ └──...
	│ │ ├──singing
	│ │ │ ├──00001
	│ │ │ │ ├──images
	│ │ │ │ ├──face_masks
	│ │ │ │ ├──lip_masks
	│ │ │ └──...
	│ │ ├──dancing
	│ │ │ ├──00001
	│ │ │ │ ├──images
	│ │ │ │ ├──face_masks
	│ │ │ │ ├──lip_masks
	│ │ │ └──...
	├── square
	│ │ ├──speech
	│ │ │ ├──00001
	│ │ │ │ ├──images
	│ │ │ │ ├──face_masks
	│ │ │ │ ├──lip_masks
	│ │ │ └──...
	│ │ ├──singing
	│ │ │ ├──00001
	│ │ │ │ ├──images
	│ │ │ │ ├──face_masks
	│ │ │ │ ├──lip_masks
	│ │ │ └──...
	│ │ ├──dancing
	│ │ │ ├──00001
	│ │ │ │ ├──images
	│ │ │ │ ├──face_masks
	│ │ │ │ ├──lip_masks
	│ │ │ └──...
	├── video_rec_path.txt
	├── video_square_path.txt
	└── video_vec_path.txt
	```
	FlashPortrait is trained on mixed-resolution videos, with 720x720 videos stored in `poirtrait_data/square`, 480x832 videos stored in `poirtrait_data/vec`, and 832x480 videos stored in `poirtrait_data/rec`. Each folder in `poirtrait_data/square` or `poirtrait_data/rec` or `poirtrait_data/vec` contains three subfolders which contains different types of videos (speech, singing, and dancing).
	All `.png` image files are named in the format `frame_i.png`, such as `frame_0.png`, `frame_1.png`, and so on.
	`00001`, `00002`, `00003` indicate individual video information.
	In terms of three subfolders, `images`, `face_masks`, and `lip_masks` store RGB frames, corresponding human face masks, and corresponding human lip masks, respectively.
	`video_square_path.txt`, `video_rec_path.txt`, and `video_vec_path.txt` record folder paths of `talking_face_data/square`, `talking_face_data/rec`, and `talking_face_data/vec`, respectively.
	For example, the content of `video_rec_path.txt` is shown as follows:
	```
	path/FlashPortrait/poirtrait_data/rec/speech/00001
	path/FlashPortrait/poirtrait_data/rec/speech/00002
	...
	path/FlashPortrait/poirtrait_data/rec/singing/00003
	path/FlashPortrait/poirtrait_data/rec/singing/00004
	...
	path/FlashPortrait/poirtrait_data/rec/dancing/00005
	path/FlashPortrait/poirtrait_data/rec/dancing/00006
	...
	```
	If you only have raw videos, you can leverage `ffmpeg` to extract frames from raw videos (speech) and store them in the subfolder `images`.
	```
	ffmpeg -i raw_video_1.mp4 -q:v 1 -start_number 0 path/FlashPortrait/poirtrait_data/rec/speech/00001/images/frame_%d.png
	```
	The obtained frames are saved in `path/FlashPortrait/poirtrait_data/rec/speech/00001/images`.

	For extracting the human face masks, please refer to [StableAnimator repo](https://github.com/Francis-Rings/StableAnimator). The Human Face Mask Extraction section in the tutorial provides off-the-shelf codes.
	For extracting the human lip masks, please refer to [StableAvatar repo](https://github.com/Francis-Rings/StableAvatar). The Human Lip Mask Extraction section in the tutorial provides off-the-shelf codes.

	When your dataset is organized exactly as outlined above, you can easily train your Wan2.1-14B-based FlashPortrait by running the following command:
	```
	# Training FlashPortrait on a mixed resolution setting (480x832, 832x480, and 720X720) in a single node
	bash train_single_machine.sh
	# Training FlashPortrait on a mixed resolution setting (480x832, 832x480, and 720X720) in multiple nodes
	bash train_multiple_machine.sh
	```
	For the parameter details of `train_single_machine.sh` and `train_multiple_machine.sh`, `CUDA_VISIBLE_DEVICES` refers to gpu devices. In my setting, I use 4 NVIDIA A100 80G to train FlashPortrait (`CUDA_VISIBLE_DEVICES=3,2,1,0`) in a single node.
	`--pretrained_model_name_or_path` and `--output_dir` refer to the pretrained Wan2.1-14B path and the checkpoint saved path of the trained FlashPortrait.
	`--train_data_square_dir`, `--train_data_rec_dir`, and `--train_data_vec_dir` are the paths of `video_square_path.txt`, `video_rec_path.txt`, and `video_vec_path.txt`, respectively.
	`--video_sample_n_frames` is the number of frames that FlashPortrait processes in a single batch.
	`--num_train_epochs` is the training epoch number.

	Since we utilize DeepSpeed-Stage-3 to train our FlashPortrait, we need to convert the saved checkpoint to fp32 as follows:
	```
	cd output_14B_dir/checkpoint-x
	python zero_to_fp32.py /path/FlashPortrait/output_14B_dir/checkpoint-x /path/FlashPortrait/output_14B_dir/checkpoint-x-fp32-infer --max_shard_size 80GB
	cd ../..
	python bin_convert_pt.py --pretrained_model_path="/path/FlashPortrait/output_14B_dir/checkpoint-x-fp32-infer"
	```
	<b>It is worth noting that training FlashPortrait requires approximately 50GB of VRAM due to the mixed-resolution (480x832, 832x480, and 720X720) training pipeline.
	However, if you train FlashPortrait exclusively on 512x512 videos, the VRAM requirement is reduced to approximately 40GB.</b>
	Additionally, The backgrounds of the selected training videos should remain static, as this helps the diffusion model calculate accurate reconstruction loss.


	### 🧱 Model Finetuning
	Regarding fully finetuning FlashPortrait, you can add `--transformer_path="path/FlashPortrait/checkpoints/FlashPortrait/transformer.pt` and `--portrait_encoder_path="path/FlashPortrait/checkpoints/FlashPortrait/portrait_encoder.pt` to the `train_single_machine.sh` or `train_multiple_machine.sh`:
	```
	# Finetuning FlashPortrait on a mixed resolution setting (480x832, 832x480, and 720X720) in a single node
	bash train_single_machine.sh
	# Finetuning FlashPortrait on a mixed resolution setting (480x832, 832x480, and 720X720) in multiple nodes
	bash train_multiple_machine.sh
	```

	### 🧱 VRAM requirement

	For the 10s video (720x1280, fps=25), FlashPortrait (--GPU_memory_mode="model_full_load") requires approximately 60GB VRAM on a A100 GPU (--GPU_memory_mode="sequential_cpu_offload" requires approximately 10GB VRAM).

	<b>🔥🔥Theoretically, FlashPortrait is capable of synthesizing hours of video without significant quality degradation; however, the 3D VAE decoder demands significant GPU memory, especially when decoding 10k+ frames. You have the option to run the VAE on CPU.🔥🔥</b>

	### 🧱 Acknowledgments
	Thanks to [Wan2.1](https://github.com/Wan-Video/Wan2.1), [PD-FGC](https://github.com/Dorniwang/PD-FGC-inference), [FantasyPortrait](https://github.com/Fantasy-AMAP/fantasy-portrait) and [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun) for open-sourcing their models and code, which provided valuable references and support for this project. Their contributions to the open-source community are truly appreciated.

	## Contact
	If you have any suggestions or find our work helpful, feel free to contact me.

	Email: francisshuyuan@gmail.com

	If you find our work useful, <b>please consider giving a star ⭐ to this github repository and citing it ❤️</b>:
	```bib
	@article{tu2025flashportrait,
	title={FlashPortrait: 6$\times$ Faster Infinite Portrait Animation with Adaptive Latent Prediction},
	author={Tu, Shuyuan and Pan, Yueming and Huang, Yinming and Han, Xintong and Xing, Zhen and Dai, Qi and Qiu, Kai and Luo, Chong and Wu, Zuxuan},
	journal={arXiv preprint arXiv:2512.16900},
	year={2025}
	}
	```