| | --- |
| | license: mit |
| | --- |
| | <h1 align="center"> |
| | ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model |
| | </h1> |
| |
|
| | <h5 align="center"> |
| | <a href='https://arxiv.org/abs/2502.20323'>Paper</a>  |
| | <a href='https://xg-chu.site/project_artalk/'>Project Website</a>  |
| | <a href='https://github.com/xg-chu/ARTalk/'>Code</a>  |
| | </h5> |
| | |
| | <h5 align="center"> |
| | <a href="https://xg-chu.site">Xuangeng Chu</a><sup>1</sup>  |
| | <a href="https://naba89.github.io">Nabarun Goswami</a><sup>1</sup>,</span>  |
| | <a href="https://cuiziteng.github.io">Ziteng Cui</a><sup>1</sup>,</span>  |
| | <a href="https://openreview.net/profile?id=~Hanqin_Wang1">Hanqin Wang</a><sup>1</sup>,</span>  |
| | <a href="https://www.mi.t.u-tokyo.ac.jp/harada/">Tatsuya Harada</a><sup>1,2</sup> |
| | <br> |
| | <sup>1</sup>The University of Tokyo, |
| | <sup>2</sup>RIKEN AIP |
| | </h5> |
| | |
| |
|
| | <div align="center"> |
| | <!-- <div align="center"> |
| | <b><img src="./demos/teaser.gif" alt="drawing" width="960"/></b> |
| | </div> --> |
| | <b> |
| | ARTalk generates realistic 3D head motions (lip sync, blinking, expressions, head poses) from audio. |
| | </b> |
| | <br> |
| | 🔥 More results can be found in our <a href="https://xg-chu.site/project_artalk/">Project Page</a>. 🔥 |
| | </div> |
| | |
| | <!-- ## TO DO |
| | We are now preparing the <b>pre-trained model and quick start materials</b> and will release it within a week. --> |
| |
|
| | ## Installation |
| | ### Clone the project |
| | ``` |
| | git clone --recurse-submodules git@github.com:xg-chu/ARTalk.git |
| | cd ARTalk |
| | ``` |
| |
|
| | ### Build environment |
| | I will prepare a new environment guide as soon as possible. |
| |
|
| | For now, please use GAGAvatar's `environment.yml` and install gradio and other dependent libraries. |
| | ``` |
| | conda env create -f environment.yml |
| | conda activate ARTalk |
| | ``` |
| |
|
| | <details> |
| | <summary><span>Install GAGAvatar Module (If you want to use realistic avatars)</span></summary> |
| |
|
| | ``` |
| | git clone --recurse-submodules git@github.com:xg-chu/diff-gaussian-rasterization.git |
| | pip install ./diff-gaussian-rasterization |
| | rm -rf ./diff-gaussian-rasterization |
| | ``` |
| |
|
| | </details> |
| |
|
| | ### Prepare resources |
| | Prepare resources with: |
| | ``` |
| | bash ./build_resources.sh |
| | ``` |
| |
|
| | ## Quick Start Guide |
| | ### Using <a href="https://github.com/gradio-app/gradio">Gradio</a> Interface |
| |
|
| | We provide a simple Gradio demo to demonstrate ARTalk's capabilities: |
| | ``` |
| | python inference.py --run_app |
| | ``` |
| |
|
| | ### Command Line Usage |
| |
|
| | ARTalk can be used via command line: |
| | ``` |
| | python inference.py -a your_audio_path --shape_id your_apperance --style_id your_style_motion --clip_length 750 |
| | ``` |
| | `--shape_id` can be specified with `mesh` or tracked real avatars stored in `tracked.pt`. |
| |
|
| | `--style_id` can be specified with the name of `*.pt` stored in `assets/style_motion`. |
| |
|
| | `--clip_length` sets the maximum duration of the rendered video and can be adjusted as needed. Longer videos may take more time to render. |
| |
|
| | <details> |
| | <summary><span>Track new real head avatar and new style motion</span></summary> |
| |
|
| | The file `tracked.pt` is generated using <a href="https://github.com/xg-chu/GAGAvatar/blob/main/inference.py">`GAGAvatar/inference.py`</a>. Here I've included several examples of tracked avatars for quick testing. |
| |
|
| | The style motion is tracked with EMICA module in <a href="https://github.com/xg-chu/GAGAvatar_track">`GAGAvatar_track` </a>. Each contains `50*106` dimensional data. `50` is 2 seconds consecutive frames, `106` is `100` expression code and `6` pose code (base+jaw). Here I've included several examples of tracked style motion. |
| | </details> |
| |
|
| | ## Training |
| |
|
| | This version modifies the VQVAE part compared to the paper version. |
| |
|
| | The training code and the paper version code are still in preparation and are expected to be released later. |
| |
|
| |
|
| | ## Acknowledgements |
| |
|
| | We thank <a href="https://www.linkedin.com/in/lars-traaholt-vågnes-432725130/">Lars Traaholt Vågnes</a> and <a href="https://emmanueliarussi.github.io">Emmanuel Iarussi</a> from <a href="https://www.simli.com">Simli</a> for the insightful discussions! 🤗 |
| |
|
| | The ARTalk logo was designed by Caihong Ning. |
| |
|
| | Some part of our work is built based on FLAME. |
| | We also thank the following projects for sharing their great work. |
| | - **GAGAvatar**: https://github.com/xg-chu/GAGAvatar |
| | - **GPAvatar**: https://github.com/xg-chu/GPAvatar |
| | - **FLAME**: https://flame.is.tue.mpg.de |
| | - **EMICA**: https://github.com/radekd91/inferno |
| |
|
| |
|
| | ## Citation |
| | If you find our work useful in your research, please consider citing: |
| | ```bibtex |
| | @misc{ |
| | chu2025artalk, |
| | title={ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model}, |
| | author={Xuangeng Chu and Nabarun Goswami and Ziteng Cui and Hanqin Wang and Tatsuya Harada}, |
| | year={2025}, |
| | eprint={2502.20323}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV}, |
| | url={https://arxiv.org/abs/2502.20323}, |
| | } |
| | ``` |