marcelo1126 commited on
Commit
b036398
·
verified ·
1 Parent(s): 1787f55

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +390 -387
README.md CHANGED
@@ -1,387 +1,390 @@
1
- <div align="center">
2
-
3
- <p align="center">
4
- <img src="assets/logo2.jpg" alt="InfinteTalk" width="440"/>
5
- </p>
6
-
7
- <h1>InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing</h1>
8
-
9
-
10
- [Shaoshu Yang*](https://scholar.google.com/citations?user=JrdZbTsAAAAJ&hl=en) · [Zhe Kong*](https://scholar.google.com/citations?user=4X3yLwsAAAAJ&hl=zh-CN) · [Feng Gao*](https://scholar.google.com/citations?user=lFkCeoYAAAAJ) · [Meng Cheng*]() · [Xiangyu Liu*]() · [Yong Zhang](https://yzhang2016.github.io/)<sup>&#9993;</sup> · [Zhuoliang Kang](https://scholar.google.com/citations?user=W1ZXjMkAAAAJ&hl=en)
11
-
12
- [Wenhan Luo](https://whluo.github.io/) · [Xunliang Cai](https://openreview.net/profile?id=~Xunliang_Cai1) · [Ran He](https://scholar.google.com/citations?user=ayrg9AUAAAAJ&hl=en)· [Xiaoming Wei](https://scholar.google.com/citations?user=JXV5yrZxj5MC&hl=zh-CN)
13
-
14
- <sup>*</sup>Equal Contribution
15
- <sup>&#9993;</sup>Corresponding Authors
16
-
17
- <a href='https://meigen-ai.github.io/InfiniteTalk/'><img src='https://img.shields.io/badge/Project-Page-green'></a>
18
- <a href='https://arxiv.org/abs/2508.14033'><img src='https://img.shields.io/badge/Technique-Report-red'></a>
19
- <a href='https://huggingface.co/MeiGen-AI/InfiniteTalk'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
20
- </div>
21
-
22
- > **TL; DR:** InfiniteTalk is an unlimited-length talking video generation​​ model that supports both audio-driven video-to-video and image-to-video generation
23
-
24
- <p align="center">
25
- <img src="assets/pipeline.png">
26
- </p>
27
-
28
-
29
-
30
-
31
-
32
-
33
-
34
- ## 🔥 Latest News
35
-
36
- * August 19, 2025: We release the [Technique-Report](https://arxiv.org/abs/2508.14033) , weights, and code of **InfiniteTalk**. The Gradio and the [ComfyUI](https://github.com/MeiGen-AI/InfiniteTalk/tree/comfyui) branch have been released.
37
- * August 19, 2025: We release the [project page](https://meigen-ai.github.io/InfiniteTalk/) of **InfiniteTalk**
38
-
39
-
40
- ## Key Features
41
- We propose **InfiniteTalk**​​, a novel sparse-frame video dubbing framework. Given an input video and audio track, InfiniteTalk synthesizes a new video with ​​accurate lip synchronization​​ while ​​simultaneously aligning head movements, body posture, and facial expressions​​ with the audio. Unlike traditional dubbing methods that focus solely on lips, InfiniteTalk enables ​​infinite-length video generation​​ with accurate lip synchronization and consistent identity preservation. Beside, InfiniteTalk can also be used as an image-audio-to-video model with an image and an audio as input.
42
- - 💬 ​​Sparse-frame Video Dubbing​​ – Synchronizes not only lips, but aslo head, body, and expressions
43
- - ⏱️ ​​Infinite-Length Generation​​ – Supports unlimited video duration
44
- -​​Stability​​ – Reduces hand/body distortions compared to MultiTalk
45
- - 🚀 ​​Lip Accuracy​​ Achieves superior lip synchronization to MultiTalk
46
-
47
-
48
-
49
- ## 🌐 Community Works
50
- - [Wan2GP](https://github.com/deepbeepmeep/Wan2GP/): Thanks [deepbeepmeep](https://github.com/deepbeepmeep) for integrating InfiniteTalk in Wan2GP that is optimized for low VRAM and offers many video edtiting option and other models (MMaudio support, Qwen Image Edit, ...).
51
- - [ComfyUI](https://github.com/kijai/ComfyUI-WanVideoWrapper): Thanks for the comfyui support of [kijai](https://github.com/kijai).
52
-
53
-
54
-
55
- ## 📑 Todo List
56
-
57
- - [x] Release the technical report
58
- - [x] Inference
59
- - [x] Checkpoints
60
- - [x] Multi-GPU Inference
61
- - [ ] Inference acceleration
62
- - [x] TeaCache
63
- - [x] int8 quantization
64
- - [ ] LCM distillation
65
- - [ ] Sparse Attention
66
- - [x] Run with very low VRAM
67
- - [x] Gradio demo
68
- - [x] ComfyUI
69
-
70
- ## Video Demos
71
-
72
-
73
- ### Video-to-video (HQ videos can be found on [Google Drive](https://drive.google.com/drive/folders/1BNrH6GJZ2Wt5gBuNLmfXZ6kpqb9xFPjU?usp=sharing) )
74
-
75
-
76
- <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
77
- <tr>
78
- <td>
79
- <video src="https://github.com/user-attachments/assets/04f15986-8de7-4bb4-8cde-7f7f38244f9f" width="320" controls loop></video>
80
- </td>
81
- <td>
82
- <video src="https://github.com/user-attachments/assets/1500f72e-a096-42e5-8b44-f887fa8ae7cb" width="320" controls loop></video>
83
- </td>
84
- <td>
85
- <video src="https://github.com/user-attachments/assets/28f484c2-87dc-4828-a9e7-cb963da92d14" width="320" controls loop></video>
86
- </td>
87
- <td>
88
- <video src="https://github.com/user-attachments/assets/665fabe4-3e24-4008-a0a2-a66e2e57c38b" width="320" controls loop></video>
89
- </td>
90
- </tr>
91
- </table>
92
-
93
- ### Image-to-video
94
-
95
- <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
96
- <tr>
97
- <td>
98
- <video src="https://github.com/user-attachments/assets/7e4a4dad-9666-4896-8684-2acb36aead59" width="320" controls loop></video>
99
- </td>
100
- <td>
101
- <video src="https://github.com/user-attachments/assets/bd6da665-f34d-4634-ae94-b4978f92ad3a" width="320" controls loop></video>
102
- </td>
103
- <td>
104
- <video src="https://github.com/user-attachments/assets/510e2648-82db-4648-aaf3-6542303dbe22" width="320" controls loop></video>
105
- </td>
106
- <td>
107
- <video src="https://github.com/user-attachments/assets/27bb087b-866a-4300-8a03-3bbb4ce3ddf9" width="320" controls loop></video>
108
- </td>
109
-
110
- </tr>
111
- <tr>
112
- <td>
113
- <video src="https://github.com/user-attachments/assets/3263c5e1-9f98-4b9b-8688-b3e497460a76" width="320" controls loop></video>
114
- </td>
115
- <td>
116
- <video src="https://github.com/user-attachments/assets/5ff3607f-90ec-4eee-b964-9d5ee3028005" width="320" controls loop></video>
117
- </td>
118
- <td>
119
- <video src="https://github.com/user-attachments/assets/e504417b-c8c7-4cf0-9afa-da0f3cbf3726" width="320" controls loop></video>
120
- </td>
121
- <td>
122
- <video src="https://github.com/user-attachments/assets/56aac91e-c51f-4d44-b80d-7d115e94ead7" width="320" controls loop></video>
123
- </td>
124
-
125
- </tr>
126
- </table>
127
-
128
- ## Quick Start
129
-
130
- ### 🛠️Installation
131
-
132
- #### 1. Create a conda environment and install pytorch, xformers
133
- ```
134
- conda create -n multitalk python=3.10
135
- conda activate multitalk
136
- pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
137
- pip install -U xformers==0.0.28 --index-url https://download.pytorch.org/whl/cu121
138
- ```
139
- #### 2. Flash-attn installation:
140
- ```
141
- pip install misaki[en]
142
- pip install ninja
143
- pip install psutil
144
- pip install packaging
145
- pip install wheel
146
- pip install flash_attn==2.7.4.post1
147
- ```
148
-
149
- #### 3. Other dependencies
150
- ```
151
- pip install -r requirements.txt
152
- conda install -c conda-forge librosa
153
- ```
154
-
155
- #### 4. FFmeg installation
156
- ```
157
- conda install -c conda-forge ffmpeg
158
- ```
159
- or
160
- ```
161
- sudo yum install ffmpeg ffmpeg-devel
162
- ```
163
-
164
- ### 🧱Model Preparation
165
-
166
- #### 1. Model Download
167
-
168
- | Models | Download Link | Notes |
169
- | --------------|-------------------------------------------------------------------------------|-------------------------------|
170
- | Wan2.1-I2V-14B-480P | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) | Base model
171
- | chinese-wav2vec2-base | 🤗 [Huggingface](https://huggingface.co/TencentGameMate/chinese-wav2vec2-base) | Audio encoder
172
- | MeiGen-InfiniteTalk | 🤗 [Huggingface](https://huggingface.co/MeiGen-AI/InfiniteTalk) | Our audio condition weights
173
-
174
- Download models using huggingface-cli:
175
- ``` sh
176
- huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./weights/Wan2.1-I2V-14B-480P
177
- huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./weights/chinese-wav2vec2-base
178
- huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./weights/chinese-wav2vec2-base
179
- huggingface-cli download MeiGen-AI/InfiniteTalk --local-dir ./weights/InfiniteTalk
180
-
181
- ```
182
-
183
- ### 🔑 Quick Inference
184
-
185
- Our model is compatible with both 480P and 720P resolutions.
186
- > Some tips
187
- > - Lip synchronization accuracy:​​ Audio CFG works optimally between 3–5. Increase the audio CFG value for better synchronization.
188
- > - FusionX: While it enables faster inference and higher quality, FusionX LoRA exacerbates color shift over 1 minute and reduces ID preservation in videos.
189
- > - V2V generation: Enables unlimited length generation. The model mimics the original video's camera movement, though not identically. Using SDEdit improves camera movement accuracy significantly but introduces color shift and is best suited for short clips. Improvements for long video camera control are planned.
190
- > - I2V generation: Generates good results from a single image for up to 1 minute. Beyond 1 minute, color shifts become more pronounced. One trick for the high-quailty generation beyond 1 min is to copy the image to a video by translating or zooming in the image. Here is a script to [convert image to video](https://github.com/MeiGen-AI/InfiniteTalk/blob/main/tools/convert_img_to_video.py).
191
- > - Quantization model: If your inference process is killed due to insufficient memory, we suggest using the quantization model, which can help **reduce memory usage**.
192
-
193
- #### Usage of InfiniteTalk
194
- ```
195
- --mode streaming: long video generation.
196
- --mode clip: generate short video with one chunk.
197
- --use_teacache: run with TeaCache.
198
- --size infinitetalk-480: generate 480P video.
199
- --size infinitetalk-720: generate 720P video.
200
- --use_apg: run with APG.
201
- --teacache_thresh: A coefficient used for TeaCache acceleration
202
- —-sample_text_guide_scale: When not using LoRA, the optimal value is 5. After applying LoRA, the recommended value is 1.
203
- —-sample_audio_guide_scale: When not using LoRA, the optimal value is 4. After applying LoRA, the recommended value is 2.
204
- —-sample_audio_guide_scale: When not using LoRA, the optimal value is 4. After applying LoRA, the recommended value is 2.
205
- --max_frame_num: The max frame length of the generated video, the default is 40 seconds(1000 frames).
206
- ```
207
-
208
- #### 1. Inference
209
-
210
- ##### 1) Run with single GPU
211
-
212
-
213
- ```
214
- python generate_infinitetalk.py \
215
- --ckpt_dir weights/Wan2.1-I2V-14B-480P \
216
- --wav2vec_dir 'weights/chinese-wav2vec2-base' \
217
- --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
218
- --input_json examples/single_example_image.json \
219
- --size infinitetalk-480 \
220
- --sample_steps 40 \
221
- --mode streaming \
222
- --motion_frame 9 \
223
- --save_file infinitetalk_res
224
-
225
- ```
226
-
227
- ##### 2) Run with 720P
228
-
229
- If you want run with 720P, set `--size infinitetalk-720`:
230
-
231
- ```
232
- python generate_infinitetalk.py \
233
- --ckpt_dir weights/Wan2.1-I2V-14B-480P \
234
- --wav2vec_dir 'weights/chinese-wav2vec2-base' \
235
- --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
236
- --input_json examples/single_example_image.json \
237
- --size infinitetalk-720 \
238
- --sample_steps 40 \
239
- --mode streaming \
240
- --motion_frame 9 \
241
- --save_file infinitetalk_res_720p
242
-
243
- ```
244
-
245
- ##### 3) Run with very low VRAM
246
-
247
- If you want run with very low VRAM, set `--num_persistent_param_in_dit 0`:
248
-
249
-
250
- ```
251
- python generate_infinitetalk.py \
252
- --ckpt_dir weights/Wan2.1-I2V-14B-480P \
253
- --wav2vec_dir 'weights/chinese-wav2vec2-base' \
254
- --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
255
- --input_json examples/single_example_image.json \
256
- --size infinitetalk-480 \
257
- --sample_steps 40 \
258
- --num_persistent_param_in_dit 0 \
259
- --mode streaming \
260
- --motion_frame 9 \
261
- --save_file infinitetalk_res_lowvram
262
- ```
263
-
264
- ##### 4) Multi-GPU inference
265
-
266
- ```
267
- GPU_NUM=8
268
- torchrun --nproc_per_node=$GPU_NUM --standalone generate_infinitetalk.py \
269
- --ckpt_dir weights/Wan2.1-I2V-14B-480P \
270
- --wav2vec_dir 'weights/chinese-wav2vec2-base' \
271
- --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
272
- --dit_fsdp --t5_fsdp \
273
- --ulysses_size=$GPU_NUM \
274
- --input_json examples/single_example_image.json \
275
- --size infinitetalk-480 \
276
- --sample_steps 40 \
277
- --mode streaming \
278
- --motion_frame 9 \
279
- --save_file infinitetalk_res_multigpu
280
- ```
281
-
282
- ##### 5) Multi-Person animation
283
-
284
- ```
285
- python generate_infinitetalk.py \
286
- --ckpt_dir weights/Wan2.1-I2V-14B-480P \
287
- --wav2vec_dir 'weights/chinese-wav2vec2-base' \
288
- --infinitetalk_dir weights/InfiniteTalk/multi/infinitetalk.safetensors \
289
- --input_json examples/multi_example_image.json \
290
- --size infinitetalk-480 \
291
- --sample_steps 40 \
292
- --num_persistent_param_in_dit 0 \
293
- --mode streaming \
294
- --motion_frame 9 \
295
- --save_file infinitetalk_res_multiperson
296
- ```
297
-
298
-
299
- #### 2. Run with FusioniX or Lightx2v(Require only 4~8 steps)
300
-
301
- [FusioniX](https://huggingface.co/vrgamedevgirl84/Wan14BT2VFusioniX/blob/main/FusionX_LoRa/Wan2.1_I2V_14B_FusionX_LoRA.safetensors) require 8 steps and [lightx2v](https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank32.safetensors) requires only 4 steps.
302
-
303
- ```
304
- python generate_infinitetalk.py \
305
- --ckpt_dir weights/Wan2.1-I2V-14B-480P \
306
- --wav2vec_dir 'weights/chinese-wav2vec2-base' \
307
- --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
308
- --lora_dir weights/Wan2.1_I2V_14B_FusionX_LoRA.safetensors \
309
- --input_json examples/single_example_image.json \
310
- --lora_scale 1.0 \
311
- --size infinitetalk-480 \
312
- --sample_text_guide_scale 1.0 \
313
- --sample_audio_guide_scale 2.0 \
314
- --sample_steps 8 \
315
- --mode streaming \
316
- --motion_frame 9 \
317
- --sample_shift 2 \
318
- --num_persistent_param_in_dit 0 \
319
- --save_file infinitetalk_res_lora
320
- ```
321
-
322
-
323
-
324
- #### 3. Run with the quantization model (Only support run with single gpu)
325
-
326
- ```
327
- python generate_infinitetalk.py \
328
- --ckpt_dir weights/Wan2.1-I2V-14B-480P \
329
- --wav2vec_dir 'weights/chinese-wav2vec2-base' \
330
- --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
331
- --input_json examples/single_example_image.json \
332
- --size infinitetalk-480 \
333
- --sample_steps 40 \
334
- --mode streaming \
335
- --quant fp8 \
336
- --quant_dir weights/InfiniteTalk/quant_models/infinitetalk_single_fp8.safetensors \
337
- --motion_frame 9 \
338
- --num_persistent_param_in_dit 0 \
339
- --save_file infinitetalk_res_quant
340
- ```
341
-
342
-
343
- #### 4. Run with Gradio
344
-
345
-
346
-
347
- ```
348
- python app.py \
349
- --ckpt_dir weights/Wan2.1-I2V-14B-480P \
350
- --wav2vec_dir 'weights/chinese-wav2vec2-base' \
351
- --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
352
- --num_persistent_param_in_dit 0 \
353
- --motion_frame 9
354
- ```
355
- or
356
- ```
357
- python app.py \
358
- --ckpt_dir weights/Wan2.1-I2V-14B-480P \
359
- --wav2vec_dir 'weights/chinese-wav2vec2-base' \
360
- --infinitetalk_dir weights/InfiniteTalk/multi/infinitetalk.safetensors \
361
- --num_persistent_param_in_dit 0 \
362
- --motion_frame 9
363
- ```
364
-
365
-
366
- ## 📚 Citation
367
-
368
- If you find our work useful in your research, please consider citing:
369
-
370
- ```
371
- @misc{yang2025infinitetalkaudiodrivenvideogeneration,
372
- title={InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing},
373
- author={Shaoshu Yang and Zhe Kong and Feng Gao and Meng Cheng and Xiangyu Liu and Yong Zhang and Zhuoliang Kang and Wenhan Luo and Xunliang Cai and Ran He and Xiaoming Wei},
374
- year={2025},
375
- eprint={2508.14033},
376
- archivePrefix={arXiv},
377
- primaryClass={cs.CV},
378
- url={https://arxiv.org/abs/2508.14033},
379
- }
380
- ```
381
-
382
- ## 📜 License
383
- The models in this repository are licensed under the Apache 2.0 License. We claim no rights over the your generated contents,
384
- granting you the freedom to use them while ensuring that your usage complies with the provisions of this license.
385
- You are fully accountable for your use of the models, which must not involve sharing any content that violates applicable laws,
386
- causes harm to individuals or groups, disseminates personal information intended for harm, spreads misinformation, or targets vulnerable populations.
387
-
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ sdk: streamlit
4
+ ---
5
+ <div align="center">
6
+
7
+ <p align="center">
8
+ <img src="assets/logo2.jpg" alt="InfinteTalk" width="440"/>
9
+ </p>
10
+
11
+ <h1>InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing</h1>
12
+
13
+
14
+ [Shaoshu Yang*](https://scholar.google.com/citations?user=JrdZbTsAAAAJ&hl=en) · [Zhe Kong*](https://scholar.google.com/citations?user=4X3yLwsAAAAJ&hl=zh-CN) · [Feng Gao*](https://scholar.google.com/citations?user=lFkCeoYAAAAJ) · [Meng Cheng*]() · [Xiangyu Liu*]() · [Yong Zhang](https://yzhang2016.github.io/)<sup>&#9993;</sup> · [Zhuoliang Kang](https://scholar.google.com/citations?user=W1ZXjMkAAAAJ&hl=en)
15
+
16
+ [Wenhan Luo](https://whluo.github.io/) · [Xunliang Cai](https://openreview.net/profile?id=~Xunliang_Cai1) · [Ran He](https://scholar.google.com/citations?user=ayrg9AUAAAAJ&hl=en)· [Xiaoming Wei](https://scholar.google.com/citations?user=JXV5yrZxj5MC&hl=zh-CN)
17
+
18
+ <sup>*</sup>Equal Contribution
19
+ <sup>&#9993;</sup>Corresponding Authors
20
+
21
+ <a href='https://meigen-ai.github.io/InfiniteTalk/'><img src='https://img.shields.io/badge/Project-Page-green'></a>
22
+ <a href='https://arxiv.org/abs/2508.14033'><img src='https://img.shields.io/badge/Technique-Report-red'></a>
23
+ <a href='https://huggingface.co/MeiGen-AI/InfiniteTalk'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
24
+ </div>
25
+
26
+ > **TL; DR:** InfiniteTalk is an unlimited-length talking video generation​​ model that supports both audio-driven video-to-video and image-to-video generation
27
+
28
+ <p align="center">
29
+ <img src="assets/pipeline.png">
30
+ </p>
31
+
32
+
33
+
34
+
35
+
36
+
37
+
38
+ ## 🔥 Latest News
39
+
40
+ * August 19, 2025: We release the [Technique-Report](https://arxiv.org/abs/2508.14033) , weights, and code of **InfiniteTalk**. The Gradio and the [ComfyUI](https://github.com/MeiGen-AI/InfiniteTalk/tree/comfyui) branch have been released.
41
+ * August 19, 2025: We release the [project page](https://meigen-ai.github.io/InfiniteTalk/) of **InfiniteTalk**
42
+
43
+
44
+ ##Key Features
45
+ We propose **InfiniteTalk**​​, a novel sparse-frame video dubbing framework. Given an input video and audio track, InfiniteTalk synthesizes a new video with ​​accurate lip synchronization​​ while ​​simultaneously aligning head movements, body posture, and facial expressions​​ with the audio. Unlike traditional dubbing methods that focus solely on lips, InfiniteTalk enables ​​infinite-length video generation​​ with accurate lip synchronization and consistent identity preservation. Beside, InfiniteTalk can also be used as an image-audio-to-video model with an image and an audio as input.
46
+ - 💬 ​​Sparse-frame Video Dubbing​​ – Synchronizes not only lips, but aslo head, body, and expressions
47
+ - ⏱️ ​​Infinite-Length Generation​​ – Supports unlimited video duration
48
+ - ✨ ​​Stability​​ – Reduces hand/body distortions compared to MultiTalk
49
+ - 🚀 ​​Lip Accuracy​​ – Achieves superior lip synchronization to MultiTalk
50
+
51
+
52
+
53
+ ## 🌐 Community Works
54
+ - [Wan2GP](https://github.com/deepbeepmeep/Wan2GP/): Thanks [deepbeepmeep](https://github.com/deepbeepmeep) for integrating InfiniteTalk in Wan2GP that is optimized for low VRAM and offers many video edtiting option and other models (MMaudio support, Qwen Image Edit, ...).
55
+ - [ComfyUI](https://github.com/kijai/ComfyUI-WanVideoWrapper): Thanks for the comfyui support of [kijai](https://github.com/kijai).
56
+
57
+
58
+
59
+ ## 📑 Todo List
60
+
61
+ - [x] Release the technical report
62
+ - [x] Inference
63
+ - [x] Checkpoints
64
+ - [x] Multi-GPU Inference
65
+ - [ ] Inference acceleration
66
+ - [x] TeaCache
67
+ - [x] int8 quantization
68
+ - [ ] LCM distillation
69
+ - [ ] Sparse Attention
70
+ - [x] Run with very low VRAM
71
+ - [x] Gradio demo
72
+ - [x] ComfyUI
73
+
74
+ ## Video Demos
75
+
76
+
77
+ ### Video-to-video (HQ videos can be found on [Google Drive](https://drive.google.com/drive/folders/1BNrH6GJZ2Wt5gBuNLmfXZ6kpqb9xFPjU?usp=sharing) )
78
+
79
+
80
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
81
+ <tr>
82
+ <td>
83
+ <video src="https://github.com/user-attachments/assets/04f15986-8de7-4bb4-8cde-7f7f38244f9f" width="320" controls loop></video>
84
+ </td>
85
+ <td>
86
+ <video src="https://github.com/user-attachments/assets/1500f72e-a096-42e5-8b44-f887fa8ae7cb" width="320" controls loop></video>
87
+ </td>
88
+ <td>
89
+ <video src="https://github.com/user-attachments/assets/28f484c2-87dc-4828-a9e7-cb963da92d14" width="320" controls loop></video>
90
+ </td>
91
+ <td>
92
+ <video src="https://github.com/user-attachments/assets/665fabe4-3e24-4008-a0a2-a66e2e57c38b" width="320" controls loop></video>
93
+ </td>
94
+ </tr>
95
+ </table>
96
+
97
+ ### Image-to-video
98
+
99
+ <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
100
+ <tr>
101
+ <td>
102
+ <video src="https://github.com/user-attachments/assets/7e4a4dad-9666-4896-8684-2acb36aead59" width="320" controls loop></video>
103
+ </td>
104
+ <td>
105
+ <video src="https://github.com/user-attachments/assets/bd6da665-f34d-4634-ae94-b4978f92ad3a" width="320" controls loop></video>
106
+ </td>
107
+ <td>
108
+ <video src="https://github.com/user-attachments/assets/510e2648-82db-4648-aaf3-6542303dbe22" width="320" controls loop></video>
109
+ </td>
110
+ <td>
111
+ <video src="https://github.com/user-attachments/assets/27bb087b-866a-4300-8a03-3bbb4ce3ddf9" width="320" controls loop></video>
112
+ </td>
113
+
114
+ </tr>
115
+ <tr>
116
+ <td>
117
+ <video src="https://github.com/user-attachments/assets/3263c5e1-9f98-4b9b-8688-b3e497460a76" width="320" controls loop></video>
118
+ </td>
119
+ <td>
120
+ <video src="https://github.com/user-attachments/assets/5ff3607f-90ec-4eee-b964-9d5ee3028005" width="320" controls loop></video>
121
+ </td>
122
+ <td>
123
+ <video src="https://github.com/user-attachments/assets/e504417b-c8c7-4cf0-9afa-da0f3cbf3726" width="320" controls loop></video>
124
+ </td>
125
+ <td>
126
+ <video src="https://github.com/user-attachments/assets/56aac91e-c51f-4d44-b80d-7d115e94ead7" width="320" controls loop></video>
127
+ </td>
128
+
129
+ </tr>
130
+ </table>
131
+
132
+ ## Quick Start
133
+
134
+ ### 🛠️Installation
135
+
136
+ #### 1. Create a conda environment and install pytorch, xformers
137
+ ```
138
+ conda create -n multitalk python=3.10
139
+ conda activate multitalk
140
+ pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
141
+ pip install -U xformers==0.0.28 --index-url https://download.pytorch.org/whl/cu121
142
+ ```
143
+ #### 2. Flash-attn installation:
144
+ ```
145
+ pip install misaki[en]
146
+ pip install ninja
147
+ pip install psutil
148
+ pip install packaging
149
+ pip install wheel
150
+ pip install flash_attn==2.7.4.post1
151
+ ```
152
+
153
+ #### 3. Other dependencies
154
+ ```
155
+ pip install -r requirements.txt
156
+ conda install -c conda-forge librosa
157
+ ```
158
+
159
+ #### 4. FFmeg installation
160
+ ```
161
+ conda install -c conda-forge ffmpeg
162
+ ```
163
+ or
164
+ ```
165
+ sudo yum install ffmpeg ffmpeg-devel
166
+ ```
167
+
168
+ ### 🧱Model Preparation
169
+
170
+ #### 1. Model Download
171
+
172
+ | Models | Download Link | Notes |
173
+ | --------------|-------------------------------------------------------------------------------|-------------------------------|
174
+ | Wan2.1-I2V-14B-480P | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P) | Base model
175
+ | chinese-wav2vec2-base | 🤗 [Huggingface](https://huggingface.co/TencentGameMate/chinese-wav2vec2-base) | Audio encoder
176
+ | MeiGen-InfiniteTalk | 🤗 [Huggingface](https://huggingface.co/MeiGen-AI/InfiniteTalk) | Our audio condition weights
177
+
178
+ Download models using huggingface-cli:
179
+ ``` sh
180
+ huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./weights/Wan2.1-I2V-14B-480P
181
+ huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./weights/chinese-wav2vec2-base
182
+ huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./weights/chinese-wav2vec2-base
183
+ huggingface-cli download MeiGen-AI/InfiniteTalk --local-dir ./weights/InfiniteTalk
184
+
185
+ ```
186
+
187
+ ### 🔑 Quick Inference
188
+
189
+ Our model is compatible with both 480P and 720P resolutions.
190
+ > Some tips
191
+ > - Lip synchronization accuracy:​​ Audio CFG works optimally between 3–5. Increase the audio CFG value for better synchronization.
192
+ > - FusionX: While it enables faster inference and higher quality, FusionX LoRA exacerbates color shift over 1 minute and reduces ID preservation in videos.
193
+ > - V2V generation: Enables unlimited length generation. The model mimics the original video's camera movement, though not identically. Using SDEdit improves camera movement accuracy significantly but introduces color shift and is best suited for short clips. Improvements for long video camera control are planned.
194
+ > - I2V generation: Generates good results from a single image for up to 1 minute. Beyond 1 minute, color shifts become more pronounced. One trick for the high-quailty generation beyond 1 min is to copy the image to a video by translating or zooming in the image. Here is a script to [convert image to video](https://github.com/MeiGen-AI/InfiniteTalk/blob/main/tools/convert_img_to_video.py).
195
+ > - Quantization model: If your inference process is killed due to insufficient memory, we suggest using the quantization model, which can help **reduce memory usage**.
196
+
197
+ #### Usage of InfiniteTalk
198
+ ```
199
+ --mode streaming: long video generation.
200
+ --mode clip: generate short video with one chunk.
201
+ --use_teacache: run with TeaCache.
202
+ --size infinitetalk-480: generate 480P video.
203
+ --size infinitetalk-720: generate 720P video.
204
+ --use_apg: run with APG.
205
+ --teacache_thresh: A coefficient used for TeaCache acceleration
206
+ —-sample_text_guide_scale: When not using LoRA, the optimal value is 5. After applying LoRA, the recommended value is 1.
207
+ —-sample_audio_guide_scale: When not using LoRA, the optimal value is 4. After applying LoRA, the recommended value is 2.
208
+ —-sample_audio_guide_scale: When not using LoRA, the optimal value is 4. After applying LoRA, the recommended value is 2.
209
+ --max_frame_num: The max frame length of the generated video, the default is 40 seconds(1000 frames).
210
+ ```
211
+
212
+ #### 1. Inference
213
+
214
+ ##### 1) Run with single GPU
215
+
216
+
217
+ ```
218
+ python generate_infinitetalk.py \
219
+ --ckpt_dir weights/Wan2.1-I2V-14B-480P \
220
+ --wav2vec_dir 'weights/chinese-wav2vec2-base' \
221
+ --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
222
+ --input_json examples/single_example_image.json \
223
+ --size infinitetalk-480 \
224
+ --sample_steps 40 \
225
+ --mode streaming \
226
+ --motion_frame 9 \
227
+ --save_file infinitetalk_res
228
+
229
+ ```
230
+
231
+ ##### 2) Run with 720P
232
+
233
+ If you want run with 720P, set `--size infinitetalk-720`:
234
+
235
+ ```
236
+ python generate_infinitetalk.py \
237
+ --ckpt_dir weights/Wan2.1-I2V-14B-480P \
238
+ --wav2vec_dir 'weights/chinese-wav2vec2-base' \
239
+ --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
240
+ --input_json examples/single_example_image.json \
241
+ --size infinitetalk-720 \
242
+ --sample_steps 40 \
243
+ --mode streaming \
244
+ --motion_frame 9 \
245
+ --save_file infinitetalk_res_720p
246
+
247
+ ```
248
+
249
+ ##### 3) Run with very low VRAM
250
+
251
+ If you want run with very low VRAM, set `--num_persistent_param_in_dit 0`:
252
+
253
+
254
+ ```
255
+ python generate_infinitetalk.py \
256
+ --ckpt_dir weights/Wan2.1-I2V-14B-480P \
257
+ --wav2vec_dir 'weights/chinese-wav2vec2-base' \
258
+ --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
259
+ --input_json examples/single_example_image.json \
260
+ --size infinitetalk-480 \
261
+ --sample_steps 40 \
262
+ --num_persistent_param_in_dit 0 \
263
+ --mode streaming \
264
+ --motion_frame 9 \
265
+ --save_file infinitetalk_res_lowvram
266
+ ```
267
+
268
+ ##### 4) Multi-GPU inference
269
+
270
+ ```
271
+ GPU_NUM=8
272
+ torchrun --nproc_per_node=$GPU_NUM --standalone generate_infinitetalk.py \
273
+ --ckpt_dir weights/Wan2.1-I2V-14B-480P \
274
+ --wav2vec_dir 'weights/chinese-wav2vec2-base' \
275
+ --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
276
+ --dit_fsdp --t5_fsdp \
277
+ --ulysses_size=$GPU_NUM \
278
+ --input_json examples/single_example_image.json \
279
+ --size infinitetalk-480 \
280
+ --sample_steps 40 \
281
+ --mode streaming \
282
+ --motion_frame 9 \
283
+ --save_file infinitetalk_res_multigpu
284
+ ```
285
+
286
+ ##### 5) Multi-Person animation
287
+
288
+ ```
289
+ python generate_infinitetalk.py \
290
+ --ckpt_dir weights/Wan2.1-I2V-14B-480P \
291
+ --wav2vec_dir 'weights/chinese-wav2vec2-base' \
292
+ --infinitetalk_dir weights/InfiniteTalk/multi/infinitetalk.safetensors \
293
+ --input_json examples/multi_example_image.json \
294
+ --size infinitetalk-480 \
295
+ --sample_steps 40 \
296
+ --num_persistent_param_in_dit 0 \
297
+ --mode streaming \
298
+ --motion_frame 9 \
299
+ --save_file infinitetalk_res_multiperson
300
+ ```
301
+
302
+
303
+ #### 2. Run with FusioniX or Lightx2v(Require only 4~8 steps)
304
+
305
+ [FusioniX](https://huggingface.co/vrgamedevgirl84/Wan14BT2VFusioniX/blob/main/FusionX_LoRa/Wan2.1_I2V_14B_FusionX_LoRA.safetensors) require 8 steps and [lightx2v](https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank32.safetensors) requires only 4 steps.
306
+
307
+ ```
308
+ python generate_infinitetalk.py \
309
+ --ckpt_dir weights/Wan2.1-I2V-14B-480P \
310
+ --wav2vec_dir 'weights/chinese-wav2vec2-base' \
311
+ --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
312
+ --lora_dir weights/Wan2.1_I2V_14B_FusionX_LoRA.safetensors \
313
+ --input_json examples/single_example_image.json \
314
+ --lora_scale 1.0 \
315
+ --size infinitetalk-480 \
316
+ --sample_text_guide_scale 1.0 \
317
+ --sample_audio_guide_scale 2.0 \
318
+ --sample_steps 8 \
319
+ --mode streaming \
320
+ --motion_frame 9 \
321
+ --sample_shift 2 \
322
+ --num_persistent_param_in_dit 0 \
323
+ --save_file infinitetalk_res_lora
324
+ ```
325
+
326
+
327
+
328
+ #### 3. Run with the quantization model (Only support run with single gpu)
329
+
330
+ ```
331
+ python generate_infinitetalk.py \
332
+ --ckpt_dir weights/Wan2.1-I2V-14B-480P \
333
+ --wav2vec_dir 'weights/chinese-wav2vec2-base' \
334
+ --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
335
+ --input_json examples/single_example_image.json \
336
+ --size infinitetalk-480 \
337
+ --sample_steps 40 \
338
+ --mode streaming \
339
+ --quant fp8 \
340
+ --quant_dir weights/InfiniteTalk/quant_models/infinitetalk_single_fp8.safetensors \
341
+ --motion_frame 9 \
342
+ --num_persistent_param_in_dit 0 \
343
+ --save_file infinitetalk_res_quant
344
+ ```
345
+
346
+
347
+ #### 4. Run with Gradio
348
+
349
+
350
+
351
+ ```
352
+ python app.py \
353
+ --ckpt_dir weights/Wan2.1-I2V-14B-480P \
354
+ --wav2vec_dir 'weights/chinese-wav2vec2-base' \
355
+ --infinitetalk_dir weights/InfiniteTalk/single/infinitetalk.safetensors \
356
+ --num_persistent_param_in_dit 0 \
357
+ --motion_frame 9
358
+ ```
359
+ or
360
+ ```
361
+ python app.py \
362
+ --ckpt_dir weights/Wan2.1-I2V-14B-480P \
363
+ --wav2vec_dir 'weights/chinese-wav2vec2-base' \
364
+ --infinitetalk_dir weights/InfiniteTalk/multi/infinitetalk.safetensors \
365
+ --num_persistent_param_in_dit 0 \
366
+ --motion_frame 9
367
+ ```
368
+
369
+
370
+ ## 📚 Citation
371
+
372
+ If you find our work useful in your research, please consider citing:
373
+
374
+ ```
375
+ @misc{yang2025infinitetalkaudiodrivenvideogeneration,
376
+ title={InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing},
377
+ author={Shaoshu Yang and Zhe Kong and Feng Gao and Meng Cheng and Xiangyu Liu and Yong Zhang and Zhuoliang Kang and Wenhan Luo and Xunliang Cai and Ran He and Xiaoming Wei},
378
+ year={2025},
379
+ eprint={2508.14033},
380
+ archivePrefix={arXiv},
381
+ primaryClass={cs.CV},
382
+ url={https://arxiv.org/abs/2508.14033},
383
+ }
384
+ ```
385
+
386
+ ## 📜 License
387
+ The models in this repository are licensed under the Apache 2.0 License. We claim no rights over the your generated contents,
388
+ granting you the freedom to use them while ensuring that your usage complies with the provisions of this license.
389
+ You are fully accountable for your use of the models, which must not involve sharing any content that violates applicable laws,
390
+ causes harm to individuals or groups, disseminates personal information intended for harm, spreads misinformation, or targets vulnerable populations.