FrancisRing commited on
Commit
dd3828a
·
verified ·
1 Parent(s): 38a67b0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +315 -3
README.md CHANGED
@@ -1,3 +1,315 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Wan-AI/Wan2.1-I2V-14B-720P
4
+ library_name: diffusers
5
+ license: apache-2.0
6
+ pipeline_tag: image-to-video
7
+ tags:
8
+ - video-generation
9
+ - video diffusion transformer
10
+ - audio-driven avatar animation
11
+ task_categories:
12
+ - image-to-video
13
+ - text-to-video
14
+ ---
15
+
16
+ # FlashPortrait
17
+
18
+ <a href='https://francis-rings.github.io/FlashPortrait'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='https://arxiv.org/abs/'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a> <a href='https://huggingface.co/FrancisRing/FlashPortrait/tree/main'><img src='https://img.shields.io/badge/HuggingFace-Model-orange'></a> <a href='https://www.youtube.com/watch?v=6lhvmbzvv3Y'><img src='https://img.shields.io/badge/YouTube-Watch-red?style=flat-square&logo=youtube'></a> <a href='https://www.bilibili.com/video/BV1hUt9z4EoQ'><img src='https://img.shields.io/badge/Bilibili-Watch-blue?style=flat-square&logo=bilibili'></a>
19
+
20
+ FlashPortrait: 6$\times$ Faster Infinite Portrait Animation with Adaptive Latent Prediction
21
+ <br/>
22
+ Shuyuan Tu<sup>1</sup>, Yueming Pan<sup>3</sup>, Yinming Huang<sup>1</sup>, Xintong Han<sup>4</sup>, Zhen Xing<sup>5</sup>, Qi Dai<sup>2</sup>, Kai Qiu<sup>2</sup>, Chong Luo<sup>2</sup>, Zuxuan Wu<sup>1</sup>
23
+ <br/>
24
+ [<sup>1</sup>Fudan University; <sup>2</sup>Microsoft Research Asia; <sup>3</sup>Xi'an Jiaotong University; <sup>4</sup>Tencent Inc; <sup>5</sup>Wan Team, Tongyi Lab, Alibaba Group]
25
+
26
+
27
+
28
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
29
+ <tr>
30
+ <td>
31
+ <video src="https://github.com/user-attachments/assets/f052b880-28b5-4a59-8100-77318a9e8425" width="320" controls loop></video>
32
+ </td>
33
+ <td>
34
+ <video src="https://github.com/user-attachments/assets/b698d2e7-4c90-4e95-b24f-38b53514470b" width="320" controls loop></video>
35
+ </td>
36
+ </tr>
37
+ <tr>
38
+ <td>
39
+ <video src="https://github.com/user-attachments/assets/58f4a67f-8f1f-401c-90e2-50479bf81dfb" width="320" controls loop></video>
40
+ </td>
41
+ <td>
42
+ <video src="https://github.com/user-attachments/assets/894fe221-fb09-4422-aa8f-46ce31edf1b4" width="320" controls loop></video>
43
+ </td>
44
+ </tr>
45
+ </table>
46
+
47
+ <p style="text-align: justify;">
48
+ <span>Portrait animations generated by FlashPortrait, showing its power to synthesize <b>infinite-length ID-preserving animations</b>. All videos are <b>directly synthesized by FlashPortrait without the use of any face-related post-processing tools</b>, such as the face-swapping tool FaceFusion or face restoration models like GFP-GAN and CodeFormer.</span>
49
+ </p>
50
+
51
+
52
+ <p align="center">
53
+ <video src="https://github.com/user-attachments/assets/20f34576-0689-4be2-99b1-aee550f07641" width="768" autoplay loop muted playsinline></video>
54
+ <video src="https://github.com/user-attachments/assets/fb63eb2d-c8bb-49a4-bac1-ab8ef2c96841" width="768" autoplay loop muted playsinline></video>
55
+ <br/>
56
+ <span>Comparison results between FlashPortrait and state-of-the-art (SOTA) portrait animation models highlight the superior performance of FlashPortrait in delivering <b>infinite-length, high-fidelity, identity-preserving portrait animation</b>.</span>
57
+ </p>
58
+
59
+
60
+ ## Overview
61
+
62
+ <p align="center">
63
+ <img src="assets/figures/framework.jpg" alt="model architecture" width="1280"/>
64
+ </br>
65
+ <i>The overview of the framework of FlashPortrait.</i>
66
+ </p>
67
+
68
+ Current diffusion-based acceleration methods for long-portrait animation struggle to ensure identity (ID) consistency. This paper presents FlashPortrait, an end-to-end video diffusion transformer capable of synthesizing ID-preserving, infinite-length videos while achieving up to 6$\times$ acceleration in inference speed.
69
+ In particular, FlashPortrait begins by computing the identity-agnostic facial expression features with an off-the-shelf extractor.
70
+ It then introduces a Normalized Facial Expression Block to align facial features with diffusion latents by normalizing them with their respective means and variances, thereby improving identity stability in facial modeling.
71
+ During inference, FlashPortrait adopts a dynamic sliding-window scheme with weighted blending in overlapping areas, ensuring smooth transitions and ID consistency in long animations. In each context window, based on the latent variation rate at particular timesteps and the derivative magnitude ratio among diffusion layers, FlashPortrait
72
+ utilizes higher-order latent derivatives at the current timestep to directly predict latents at future timesteps, thereby skipping several denoising steps and achieving 6$\times$ speed acceleration.
73
+ Experiments on benchmarks show the effectiveness of FlashPortrait both qualitatively and quantitatively.
74
+
75
+ ## News
76
+ * `[2025-12-15]`:🔥 The project page, code, technical report and [a basic model checkpoint](https://huggingface.co/FrancisRing/FlashPortrait/tree/main) are released. Further acceleration part (Adaptive Latent Prediction) will be released very soon. Stay tuned!
77
+
78
+ ## 🛠️ To-Do List
79
+ - [x] FlashPortrait-14B
80
+ - [x] Inference Code
81
+ - [x] Training Code
82
+ - [ ] Multiple-GPU Inference Code
83
+ - [ ] Inference Code with Adaptive Latent Prediction
84
+
85
+ ## 🔑 Quickstart
86
+
87
+ FlashPortrait supports generating <b>infinite-length videos at a 480x832 or 832x480 or 512x512 or 720x720 or 720x1280 or 1280x720 resolution</b>. If you encounter insufficient memory issues, you can appropriately reduce the number of animated frames or the resolution of the output.
88
+
89
+ ### 🧱 Environment setup
90
+
91
+ ```
92
+ pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu124
93
+ pip install -r requirements.txt
94
+ # Optional to install flash_attn to accelerate attention computation
95
+ pip install flash_attn
96
+ ```
97
+
98
+ ### 🧱 Download weights
99
+ If you encounter connection issues with Hugging Face, you can utilize the mirror endpoint by setting the environment variable: `export HF_ENDPOINT=https://hf-mirror.com`.
100
+ Please download weights manually as follows:
101
+ ```
102
+ pip install "huggingface_hub[cli]"
103
+ cd FlashPortrait
104
+ mkdir checkpoints
105
+ huggingface-cli download FrancisRing/FlashPortrait --local-dir ./checkpoints/FlashPortrait
106
+ huggingface-cli download Wan-AI/Wan2.1-I2V-14B-720P --local-dir ./checkpoints/Wan2.1-I2V-14B-720P
107
+ ```
108
+ All the weights should be organized in models as follows
109
+ The overall file structure of this project should be organized as follows:
110
+ ```
111
+ FlashPortrait/
112
+ ├── config
113
+ ├── examples
114
+ ├── wan
115
+ ├── checkpoints
116
+ │   ├── FlashPortrait
117
+ │   └── Wan2.1-I2V-14B-720P
118
+ ├── infer.py
119
+ ├── fast_infer.py
120
+ ├── train_portrait.py
121
+ ├── bin_convert_pt.py
122
+ ├── train_single_machine.sh
123
+ ├── train_multiple_machine.sh
124
+ ├── requirement.txt
125
+ ```
126
+
127
+ ### 🧱 Model inference
128
+ A sample configuration for testing is provided as `infer.py` and `fast_infer.py`. You can also easily modify the various configurations according to your needs.
129
+
130
+ ```
131
+ bash inference.sh
132
+ ```
133
+ Wan2.1-14B-based FlashPortrait supports video-driven portrait video generation at various resolution settings: 512x512, 480x832, 832x480, 720x720, 720x1280, and 1280x720. You can modify "max_size" in `infer.py` to set the resolution of the animation. "--validation_image_start", "--validation_driven_video_path", and "--prompt" in `infer.py` refer to the path of the given reference image, the path of the driven audio, and the text prompts respectively.
134
+ Prompts are also very important. It is recommended to `[Description of first frame]-[Description of human behavior]-[Description of background (optional)]`.
135
+ "--wan_model_name", "--transformer_path", and "--portrait_encoder_path" in `infer.py` are the paths of pretrained Wan2.1-14B weights, pretrained FlashPortrait DiT weights, and pretrained FlashPortrait Portrait Encoder weights, respectively.
136
+ "--num_inference_steps", "--sub_num_frames", "--latents_num_frames", "--context_overlap" and "--context_size" refer to the total number of inference steps, the synthesized rgb frame number in a batch, the synthesized latent frame number in a batch, the overlapping context length between two context windows, the synthesized latent frame number in a context window, respectively.
137
+ Notably, the recommended `--num_inference_steps` range is [30-50], more steps bring higher quality. The recommended `--context_overlap` range is [10-40], as longer overlapping length results in higher quality and slower inference speed.
138
+ "--text_cfg_scale" and "--emo_cfg_scale" are Classify-Free-Guidance scale of text prompt and portrait emotion. The recommended range for prompt and audio cfg is `[2-5]`. You can increase the emotion cfg to facilitate the emotion synchronization with the driven video.
139
+
140
+ We provide 6 cases in different resolution settings in `path/FlashPortrait/examples` for validation. ❤️❤️Please feel free to try it out and enjoy the endless entertainment of infinite-length portrait video generation❤️❤️!
141
+
142
+ #### 💡Tips
143
+ - `fast_infer.py` has faster inference speed, which has the same configuration settings as `infer.py`.
144
+
145
+ - If you have limited GPU resources, you can change the loading mode of FlashPortrait by modifying "--GPU_memory_mode" in `infer.py`. The options of "--GPU_memory_mode" are `model_full_load`, `sequential_cpu_offload`, `model_cpu_offload_and_qfloat8`, and `model_cpu_offload`. In particular, when you set `--GPU_memory_mode` to `sequential_cpu_offload`, the total GPU memory consumption is approximately 10G with slower inference speed.
146
+ Setting `--GPU_memory_mode` to `model_cpu_offload` can significantly cut GPU memory usage, reducing it by roughly half compared to `model_full_load` mode.
147
+
148
+ - higher resolution setting will result in higher quality synthesized videos (480p->720p).
149
+
150
+ ### 🧱 Model Training
151
+ <b>🔥🔥It’s worth noting that if you’re looking to train a conditioned Video Diffusion Transformer (DiT) model, such as Wan2.1, this training tutorial will also be helpful.🔥🔥</b>
152
+ For the training dataset, it has to be organized as follows:
153
+
154
+ ```
155
+ poirtrait_data/
156
+ ├── rec
157
+ │   │�� ├──speech
158
+ │   │  │  ├──00001
159
+ │   │  │  │  ├──images
160
+ │   │  │  │  │  ├──frame_0.png
161
+ │   │  │  │  │  ├──frame_1.png
162
+ │   │  │  │  │  ├──frame_2.png
163
+ │   │  │  │  │  ├──...
164
+ │   │  │  │  ├──face_masks
165
+ │   │  │  │  │  ├──frame_0.png
166
+ │   │  │  │  │  ├──frame_1.png
167
+ │   │  │  │  │  ├──frame_2.png
168
+ │   │  │  │  │  ├──...
169
+ │   │  │  │  ├──lip_masks
170
+ │   │  │  │  │  ├──frame_0.png
171
+ │   │  │  │  │  ├──frame_1.png
172
+ │   │  │  │  │  ├──frame_2.png
173
+ │   │  │  │  │  ├──...
174
+ │   │  │  ├──00002
175
+ │   │  │  │  ├──images
176
+ │   │  │  │  ├──face_masks
177
+ │   │  │  │  ├──lip_masks
178
+ │   │  │  └──...
179
+ │   │  ├──singing
180
+ │   │  │  ├──00001
181
+ │   │  │  │  ├──images
182
+ │   │  │  │  ├──face_masks
183
+ │   │  │  │  ├──lip_masks
184
+ │   │  │  └──...
185
+ │   │  ├──dancing
186
+ │   │  │  ├──00001
187
+ │   │  │  │  ├──images
188
+ │   │  │  │  ├──face_masks
189
+ │   │  │  │  ├──lip_masks
190
+ │   │  │  └──...
191
+ ├── vec
192
+ │   │  ├──speech
193
+ │   │  │  ├──00001
194
+ │   │  │  │  ├──images
195
+ │   │  │  │  ├──face_masks
196
+ │   │  │  │  ├──lip_masks
197
+ │   │  │  └──...
198
+ │   │  ├──singing
199
+ │   │  │  ├──00001
200
+ │   │  │  │  ├──images
201
+ │   │  │  │  ├──face_masks
202
+ │   │  │  │  ├──lip_masks
203
+ │   │  │  └──...
204
+ │   │  ├──dancing
205
+ │   │  │  ├──00001
206
+ │   │  │  │  ├──images
207
+ │   │  │  │  ├──face_masks
208
+ │   │  │  │  ├──lip_masks
209
+ │   │  │  └──...
210
+ ├── square
211
+ │   │  ├──speech
212
+ │   │  │  ├──00001
213
+ │   │  │  │  ├──images
214
+ │   │  │  │  ├──face_masks
215
+ │   │  │  │  ├──lip_masks
216
+ │   │  │  └──...
217
+ │   │  ├──singing
218
+ │   │  │  ├──00001
219
+ │   │  │  │  ├──images
220
+ │   │  │  │  ├──face_masks
221
+ │   │  │  │  ├──lip_masks
222
+ │   │  │  └──...
223
+ │   │  ├──dancing
224
+ │   │  │  ├──00001
225
+ │   │  │  │  ├──images
226
+ │   │  │  │  ├──face_masks
227
+ │   │  │  │  ├──lip_masks
228
+ │   │  │  └──...
229
+ ├── video_rec_path.txt
230
+ ├── video_square_path.txt
231
+ └── video_vec_path.txt
232
+ ```
233
+ FlashPortrait is trained on mixed-resolution videos, with 720x720 videos stored in `poirtrait_data/square`, 480x832 videos stored in `poirtrait_data/vec`, and 832x480 videos stored in `poirtrait_data/rec`. Each folder in `poirtrait_data/square` or `poirtrait_data/rec` or `poirtrait_data/vec` contains three subfolders which contains different types of videos (speech, singing, and dancing).
234
+ All `.png` image files are named in the format `frame_i.png`, such as `frame_0.png`, `frame_1.png`, and so on.
235
+ `00001`, `00002`, `00003` indicate individual video information.
236
+ In terms of three subfolders, `images`, `face_masks`, and `lip_masks` store RGB frames, corresponding human face masks, and corresponding human lip masks, respectively.
237
+ `video_square_path.txt`, `video_rec_path.txt`, and `video_vec_path.txt` record folder paths of `talking_face_data/square`, `talking_face_data/rec`, and `talking_face_data/vec`, respectively.
238
+ For example, the content of `video_rec_path.txt` is shown as follows:
239
+ ```
240
+ path/FlashPortrait/poirtrait_data/rec/speech/00001
241
+ path/FlashPortrait/poirtrait_data/rec/speech/00002
242
+ ...
243
+ path/FlashPortrait/poirtrait_data/rec/singing/00003
244
+ path/FlashPortrait/poirtrait_data/rec/singing/00004
245
+ ...
246
+ path/FlashPortrait/poirtrait_data/rec/dancing/00005
247
+ path/FlashPortrait/poirtrait_data/rec/dancing/00006
248
+ ...
249
+ ```
250
+ If you only have raw videos, you can leverage `ffmpeg` to extract frames from raw videos (speech) and store them in the subfolder `images`.
251
+ ```
252
+ ffmpeg -i raw_video_1.mp4 -q:v 1 -start_number 0 path/FlashPortrait/poirtrait_data/rec/speech/00001/images/frame_%d.png
253
+ ```
254
+ The obtained frames are saved in `path/FlashPortrait/poirtrait_data/rec/speech/00001/images`.
255
+
256
+ For extracting the human face masks, please refer to [StableAnimator repo](https://github.com/Francis-Rings/StableAnimator). The Human Face Mask Extraction section in the tutorial provides off-the-shelf codes.
257
+ For extracting the human lip masks, please refer to [StableAvatar repo](https://github.com/Francis-Rings/StableAvatar). The Human Lip Mask Extraction section in the tutorial provides off-the-shelf codes.
258
+
259
+ When your dataset is organized exactly as outlined above, you can easily train your Wan2.1-14B-based FlashPortrait by running the following command:
260
+ ```
261
+ # Training FlashPortrait on a mixed resolution setting (480x832, 832x480, and 720X720) in a single node
262
+ bash train_single_machine.sh
263
+ # Training FlashPortrait on a mixed resolution setting (480x832, 832x480, and 720X720) in multiple nodes
264
+ bash train_multiple_machine.sh
265
+ ```
266
+ For the parameter details of `train_single_machine.sh` and `train_multiple_machine.sh`, `CUDA_VISIBLE_DEVICES` refers to gpu devices. In my setting, I use 4 NVIDIA A100 80G to train FlashPortrait (`CUDA_VISIBLE_DEVICES=3,2,1,0`) in a single node.
267
+ `--pretrained_model_name_or_path` and `--output_dir` refer to the pretrained Wan2.1-14B path and the checkpoint saved path of the trained FlashPortrait.
268
+ `--train_data_square_dir`, `--train_data_rec_dir`, and `--train_data_vec_dir` are the paths of `video_square_path.txt`, `video_rec_path.txt`, and `video_vec_path.txt`, respectively.
269
+ `--video_sample_n_frames` is the number of frames that FlashPortrait processes in a single batch.
270
+ `--num_train_epochs` is the training epoch number.
271
+
272
+ Since we utilize DeepSpeed-Stage-3 to train our FlashPortrait, we need to convert the saved checkpoint to fp32 as follows:
273
+ ```
274
+ cd output_14B_dir/checkpoint-x
275
+ python zero_to_fp32.py /path/FlashPortrait/output_14B_dir/checkpoint-x /path/FlashPortrait/output_14B_dir/checkpoint-x-fp32-infer --max_shard_size 80GB
276
+ cd ../..
277
+ python bin_convert_pt.py --pretrained_model_path="/path/FlashPortrait/output_14B_dir/checkpoint-x-fp32-infer"
278
+ ```
279
+ <b>It is worth noting that training FlashPortrait requires approximately 50GB of VRAM due to the mixed-resolution (480x832, 832x480, and 720X720) training pipeline.
280
+ However, if you train FlashPortrait exclusively on 512x512 videos, the VRAM requirement is reduced to approximately 40GB.</b>
281
+ Additionally, The backgrounds of the selected training videos should remain static, as this helps the diffusion model calculate accurate reconstruction loss.
282
+
283
+
284
+ ### 🧱 Model Finetuning
285
+ Regarding fully finetuning FlashPortrait, you can add `--transformer_path="path/FlashPortrait/checkpoints/FlashPortrait/transformer.pt` and `--portrait_encoder_path="path/FlashPortrait/checkpoints/FlashPortrait/portrait_encoder.pt` to the `train_single_machine.sh` or `train_multiple_machine.sh`:
286
+ ```
287
+ # Finetuning FlashPortrait on a mixed resolution setting (480x832, 832x480, and 720X720) in a single node
288
+ bash train_single_machine.sh
289
+ # Finetuning FlashPortrait on a mixed resolution setting (480x832, 832x480, and 720X720) in multiple nodes
290
+ bash train_multiple_machine.sh
291
+ ```
292
+
293
+ ### 🧱 VRAM requirement
294
+
295
+ For the 10s video (720x1280, fps=25), FlashPortrait (--GPU_memory_mode="model_full_load") requires approximately 60GB VRAM on a A100 GPU (--GPU_memory_mode="sequential_cpu_offload" requires approximately 10GB VRAM).
296
+
297
+ <b>🔥🔥Theoretically, FlashPortrait is capable of synthesizing hours of video without significant quality degradation; however, the 3D VAE decoder demands significant GPU memory, especially when decoding 10k+ frames. You have the option to run the VAE on CPU.🔥🔥</b>
298
+
299
+ ### 🧱 Acknowledgments
300
+ Thanks to [Wan2.1](https://github.com/Wan-Video/Wan2.1), [PD-FGC](https://github.com/Dorniwang/PD-FGC-inference), [FantasyPortrait](https://github.com/Fantasy-AMAP/fantasy-portrait) and [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun) for open-sourcing their models and code, which provided valuable references and support for this project. Their contributions to the open-source community are truly appreciated.
301
+
302
+ ## Contact
303
+ If you have any suggestions or find our work helpful, feel free to contact me.
304
+
305
+ Email: francisshuyuan@gmail.com
306
+
307
+ If you find our work useful, <b>please consider giving a star ⭐ to this github repository and citing it ❤️</b>:
308
+ ```bib
309
+ @article{tu2025flashportrait,
310
+ title={FlashPortrait: 6$\times$ Faster Infinite Portrait Animation with Adaptive Latent Prediction},
311
+ author={Tu, Shuyuan and Pan, Yueming and Huang, Yinming and Han, Xintong and Xing, Zhen and Dai, Qi and Qiu, Kai and Luo, Chong and Wu, Zuxuan},
312
+ journal={arXiv preprint arXiv:},
313
+ year={2025}
314
+ }
315
+ ```