|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
<h1 align="center"> |
|
|
ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model |
|
|
</h1> |
|
|
|
|
|
<h5 align="center"> |
|
|
<a href='https://arxiv.org/abs/2502.20323'>Paper</a>  |
|
|
<a href='https://xg-chu.site/project_artalk/'>Project Website</a>  |
|
|
<a href='https://github.com/xg-chu/ARTalk/'>Code</a>  |
|
|
</h5> |
|
|
|
|
|
<h5 align="center"> |
|
|
<a href="https://xg-chu.site">Xuangeng Chu</a><sup>1</sup>  |
|
|
<a href="https://naba89.github.io">Nabarun Goswami</a><sup>1</sup>,</span>  |
|
|
<a href="https://cuiziteng.github.io">Ziteng Cui</a><sup>1</sup>,</span>  |
|
|
<a href="https://openreview.net/profile?id=~Hanqin_Wang1">Hanqin Wang</a><sup>1</sup>,</span>  |
|
|
<a href="https://www.mi.t.u-tokyo.ac.jp/harada/">Tatsuya Harada</a><sup>1,2</sup> |
|
|
<br> |
|
|
<sup>1</sup>The University of Tokyo, |
|
|
<sup>2</sup>RIKEN AIP |
|
|
</h5> |
|
|
|
|
|
|
|
|
<div align="center"> |
|
|
<!-- <div align="center"> |
|
|
<b><img src="./demos/teaser.gif" alt="drawing" width="960"/></b> |
|
|
</div> --> |
|
|
<b> |
|
|
ARTalk generates realistic 3D head motions (lip sync, blinking, expressions, head poses) from audio. |
|
|
</b> |
|
|
<br> |
|
|
🔥 More results can be found in our <a href="https://xg-chu.site/project_artalk/">Project Page</a>. 🔥 |
|
|
</div> |
|
|
|
|
|
<!-- ## TO DO |
|
|
We are now preparing the <b>pre-trained model and quick start materials</b> and will release it within a week. --> |
|
|
|
|
|
## Installation |
|
|
### Clone the project |
|
|
``` |
|
|
git clone --recurse-submodules git@github.com:xg-chu/ARTalk.git |
|
|
cd ARTalk |
|
|
``` |
|
|
|
|
|
### Build environment |
|
|
I will prepare a new environment guide as soon as possible. |
|
|
|
|
|
For now, please use GAGAvatar's `environment.yml` and install gradio and other dependent libraries. |
|
|
``` |
|
|
conda env create -f environment.yml |
|
|
conda activate ARTalk |
|
|
``` |
|
|
|
|
|
<details> |
|
|
<summary><span>Install GAGAvatar Module (If you want to use realistic avatars)</span></summary> |
|
|
|
|
|
``` |
|
|
git clone --recurse-submodules git@github.com:xg-chu/diff-gaussian-rasterization.git |
|
|
pip install ./diff-gaussian-rasterization |
|
|
rm -rf ./diff-gaussian-rasterization |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
### Prepare resources |
|
|
Prepare resources with: |
|
|
``` |
|
|
bash ./build_resources.sh |
|
|
``` |
|
|
|
|
|
## Quick Start Guide |
|
|
### Using <a href="https://github.com/gradio-app/gradio">Gradio</a> Interface |
|
|
|
|
|
We provide a simple Gradio demo to demonstrate ARTalk's capabilities: |
|
|
``` |
|
|
python inference.py --run_app |
|
|
``` |
|
|
|
|
|
### Command Line Usage |
|
|
|
|
|
ARTalk can be used via command line: |
|
|
``` |
|
|
python inference.py -a your_audio_path --shape_id your_apperance --style_id your_style_motion --clip_length 750 |
|
|
``` |
|
|
`--shape_id` can be specified with `mesh` or tracked real avatars stored in `tracked.pt`. |
|
|
|
|
|
`--style_id` can be specified with the name of `*.pt` stored in `assets/style_motion`. |
|
|
|
|
|
`--clip_length` sets the maximum duration of the rendered video and can be adjusted as needed. Longer videos may take more time to render. |
|
|
|
|
|
<details> |
|
|
<summary><span>Track new real head avatar and new style motion</span></summary> |
|
|
|
|
|
The file `tracked.pt` is generated using <a href="https://github.com/xg-chu/GAGAvatar/blob/main/inference.py">`GAGAvatar/inference.py`</a>. Here I've included several examples of tracked avatars for quick testing. |
|
|
|
|
|
The style motion is tracked with EMICA module in <a href="https://github.com/xg-chu/GAGAvatar_track">`GAGAvatar_track` </a>. Each contains `50*106` dimensional data. `50` is 2 seconds consecutive frames, `106` is `100` expression code and `6` pose code (base+jaw). Here I've included several examples of tracked style motion. |
|
|
</details> |
|
|
|
|
|
## Training |
|
|
|
|
|
This version modifies the VQVAE part compared to the paper version. |
|
|
|
|
|
The training code and the paper version code are still in preparation and are expected to be released later. |
|
|
|
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
We thank <a href="https://www.linkedin.com/in/lars-traaholt-vågnes-432725130/">Lars Traaholt Vågnes</a> and <a href="https://emmanueliarussi.github.io">Emmanuel Iarussi</a> from <a href="https://www.simli.com">Simli</a> for the insightful discussions! 🤗 |
|
|
|
|
|
The ARTalk logo was designed by Caihong Ning. |
|
|
|
|
|
Some part of our work is built based on FLAME. |
|
|
We also thank the following projects for sharing their great work. |
|
|
- **GAGAvatar**: https://github.com/xg-chu/GAGAvatar |
|
|
- **GPAvatar**: https://github.com/xg-chu/GPAvatar |
|
|
- **FLAME**: https://flame.is.tue.mpg.de |
|
|
- **EMICA**: https://github.com/radekd91/inferno |
|
|
|
|
|
|
|
|
## Citation |
|
|
If you find our work useful in your research, please consider citing: |
|
|
```bibtex |
|
|
@misc{ |
|
|
chu2025artalk, |
|
|
title={ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model}, |
|
|
author={Xuangeng Chu and Nabarun Goswami and Ziteng Cui and Hanqin Wang and Tatsuya Harada}, |
|
|
year={2025}, |
|
|
eprint={2502.20323}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2502.20323}, |
|
|
} |
|
|
``` |