File size: 8,272 Bytes
65dc654 b80681a 65dc654 b80681a 65dc654 b80681a 65dc654 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
license: apache-2.0
---
<h1 align='center'>EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation</h1>
<div align='center'>
<a href='https://github.com/mengrang' target='_blank'>Rang Meng</a><sup>1</sup> 
<a href='https://github.com/' target='_blank'>Yan Wang</a> 
<a href='https://github.com/' target='_blank'>Weipeng Wu</a> 
<a href='https://github.com/' target='_blank'>Ruobing Zheng</a> 
<a href='https://lymhust.github.io/' target='_blank'>Yuming Li</a><sup>2</sup> 
<a href='https://openreview.net/profile?id=~Chenguang_Ma3' target='_blank'>Chenguang Ma</a><sup>2</sup>
</div>
<div align='center'>
Terminal Technology Department, Alipay, Ant Group.
</div>
<p align='center'>
<sup>1</sup>Core Contributor 
<sup>2</sup>Corresponding Authors
</p>
<div align='center'>
<a href='https://antgroup.github.io/ai/echomimic_v3/'><img src='https://img.shields.io/badge/Project-Page-blue'></a>
<!-- <a href='https://huggingface.co/BadToBest/EchoMimicV3'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-yellow'></a> -->
<!--<a href='https://antgroup.github.io/ai/echomimic_v2/'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Demo-yellow'></a>-->
<!-- <a href='https://modelscope.cn/models/BadToBest/EchoMimicV3'><img src='https://img.shields.io/badge/ModelScope-Model-purple'></a> -->
<!--<a href='https://antgroup.github.io/ai/echomimic_v2/'><img src='https://img.shields.io/badge/ModelScope-Demo-purple'></a>-->
<a href='https://arxiv.org/abs/2507.03905'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
<!-- <a href='https://openaccess.thecvf.com/content/CVPR2025/papers/Meng_EchoMimicV2_Towards_Striking_Simplified_and_Semi-Body_Human_Animation_CVPR_2025_paper.pdf'><img src='https://img.shields.io/badge/Paper-CVPR2025-blue'></a> -->
<!-- <a href='https://github.com/antgroup/echomimic_v2/blob/main/assets/halfbody_demo/wechat_group.png'><img src='https://badges.aleen42.com/src/wechat.svg'></a> -->
</div>
<!-- <div align='center'>
<a href='https://github.com/antgroup/echomimic_v3/discussions/0'><img src='https://img.shields.io/badge/English-Common Problems-orange'></a>
<a href='https://github.com/antgroup/echomimic_v3/discussions/1'><img src='https://img.shields.io/badge/ไธญๆ็-ๅธธ่ง้ฎ้ขๆฑๆป-orange'></a>
</div> -->
## 🚀 EchoMimic Series
* EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation. [GitHub](https://github.com/antgroup/echomimic_v3)
* EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation. [GitHub](https://github.com/antgroup/echomimic_v2)
* EchoMimicV1: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning. [GitHub](https://github.com/antgroup/echomimic)
## 📣 Updates
<!-- * [2025.02.27] ๐ฅ EchoMimicV2 is accepted by CVPR 2025.
* [2025.01.16] ๐ฅ Please check out the [discussions](https://github.com/antgroup/echomimic_v2/discussions) to learn how to start EchoMimicV2.
* [2025.01.16] ๐๐ฅ [GradioUI for Accelerated EchoMimicV2](https://github.com/antgroup/echomimic_v2/blob/main/app_acc.py) is now available.
* [2025.01.03] ๐๐ฅ **One Minute is All You Need to Generate Video**. [Accelerated EchoMimicV2](https://github.com/antgroup/echomimic_v2/blob/main/infer_acc.py) are released. The inference speed can be improved by 9x (from ~7mins/120frames to ~50s/120frames on A100 GPU).
* [2024.12.16] ๐ฅ [RefImg-Pose Alignment Demo](https://github.com/antgroup/echomimic_v2/blob/main/demo.ipynb) is now available, which involves aligning reference image, extracting pose from driving video, and generating video.
* [2024.11.27] ๐ฅ [Installation tutorial](https://www.youtube.com/watch?v=2ab6U1-nVTQ) is now available. Thanks [AiMotionStudio](https://www.youtube.com/@AiMotionStudio) for the contribution.
* [2024.11.22] ๐ฅ [GradioUI](https://github.com/antgroup/echomimic_v2/blob/main/app.py) is now available. Thanks @gluttony-10 for the contribution.
* [2024.11.22] ๐ฅ [ComfyUI](https://github.com/smthemex/ComfyUI_EchoMimic) is now available. Thanks @smthemex for the contribution.
* [2024.11.21] ๐ฅ We release the EMTD dataset list and processing scripts.
* [2024.11.21] ๐ฅ We release our [EchoMimicV2](https://github.com/antgroup/echomimic_v2) codes and models. -->
<!-- * [2025.08.08] ๐ฅ We release our [codes](https://arxiv.org/abs/2507.03905). -->
* [2025.07.08] ๐ฅ Our [paper](https://arxiv.org/abs/2507.03905) is in public on arxiv.
## 🌅 Gallery
<p align="center">
<img src="asset/echomimicv3.jpg" height=700>
</p>
<table class="center">
<tr>
<td width=100% style="border: none">
<video controls loop src="https://github.com/user-attachments/assets/f33edb30-66b1-484b-8be0-a5df20a44f3b" muted="false"></video>
</td>
</tr>
</table>
For more demo videos, please refer to the project page.
## Quick Start
### Environment Setup
- Tested System Environment: Centos 7.2/Ubuntu 22.04, Cuda >= 12.1
- Tested GPUs: A100(80G) / RTX4090D (24G) / V100(16G)
- Tested Python Version: 3.10 / 3.11
### ๐ ๏ธInstallation
#### 1. Create a conda environment and install pytorch, xformers
```
conda create -n echomimic_v3 python=3.10
conda activate echomimic_v3
```
#### 2. Other dependencies
```
pip install -r requirements.txt
```
### ๐งฑModel Preparation
| Models | Download Link | Notes |
| --------------|-------------------------------------------------------------------------------|-------------------------------|
| Wan2.1-Fun-1.3B-InP | ๐ค [Huggingface](https://huggingface.co/spaces/alibaba-pai/Wan2.1-Fun-1.3B-InP) | Base model
| wav2vec2-base | ๐ค [Huggingface](https://huggingface.co/facebook/wav2vec2-base-960h) | Audio encoder
| EchoMimicV3 | ๐ค [Huggingface](https://huggingface.co/BadToBest/EchoMimicV3) | Our weights
-- The **weights** is organized as follows.
```
./models/
โโโ Wan2.1-Fun-1.3B-InP
โโโ wav2vec2-base-960h
โโโ transformer
โโโ diffusion_pytorch_model.safetensors
### ๐ Quick Inference
```
python infer.py
```
> Tips
> - Audio CFG: Audio CFG works optimally between 2~3. Increase the audio CFG value for better lip synchronization, while decreasing the audio CFG value can improve the visual quality.
> - Text CFG: Text CFG works optimally between 4~6. Increase the text CFG value for better prompt following, while decreasing the text CFG value can improve the visual quality.
> - TeaCache: The optimal range for `--teacache_thresh` is between 0~0.1.
> - Sampling steps: 5 steps for talking head, 15~25 steps for talking body.
> - โLong video generation: If you want to generate a video longer than 138 frames, you can use Long Video CFG.
## ๐ TODO List
| Status | Milestone |
|:--------:|:-------------------------------------------------------------------------|
| 2025.08.08 | The inference code of EchoMimicV3 meet everyone on GitHub |
| ๐ | Preview version Pretrained models trained on English and Chinese on HuggingFace |
| ๐ | Preview version Pretrained models trained on English and Chinese on ModelScope |
| ๐ | 720P Pretrained models trained on English and Chinese on HuggingFace |
| ๐ | 720P Pretrained models trained on English and Chinese on ModelScope |
| ๐ | The training code of EchoMimicV3 meet everyone on GitHub |
## 📒 Citation
If you find our work useful for your research, please consider citing the paper :
```
@misc{meng2025echomimicv3,
title={EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation},
author={Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma},
year={2025},
eprint={2507.03905},
archivePrefix={arXiv}
}
```
## 🌟 Star History
[](https://star-history.com/#antgroup/echomimic_v3&Date)
|