--- license: mit ---

ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

PaperProject WebsiteCode
Xuangeng Chu1Nabarun Goswami1,  Ziteng Cui1,  Hanqin Wang1,  Tatsuya Harada1,2
1The University of Tokyo, 2RIKEN AIP
ARTalk generates realistic 3D head motions (lip sync, blinking, expressions, head poses) from audio.
🔥 More results can be found in our Project Page. 🔥
## Installation ### Clone the project ``` git clone --recurse-submodules git@github.com:xg-chu/ARTalk.git cd ARTalk ``` ### Build environment I will prepare a new environment guide as soon as possible. For now, please use GAGAvatar's `environment.yml` and install gradio and other dependent libraries. ``` conda env create -f environment.yml conda activate ARTalk ```
Install GAGAvatar Module (If you want to use realistic avatars) ``` git clone --recurse-submodules git@github.com:xg-chu/diff-gaussian-rasterization.git pip install ./diff-gaussian-rasterization rm -rf ./diff-gaussian-rasterization ```
### Prepare resources Prepare resources with: ``` bash ./build_resources.sh ``` ## Quick Start Guide ### Using Gradio Interface We provide a simple Gradio demo to demonstrate ARTalk's capabilities: ``` python inference.py --run_app ``` ### Command Line Usage ARTalk can be used via command line: ``` python inference.py -a your_audio_path --shape_id your_apperance --style_id your_style_motion --clip_length 750 ``` `--shape_id` can be specified with `mesh` or tracked real avatars stored in `tracked.pt`. `--style_id` can be specified with the name of `*.pt` stored in `assets/style_motion`. `--clip_length` sets the maximum duration of the rendered video and can be adjusted as needed. Longer videos may take more time to render.
Track new real head avatar and new style motion The file `tracked.pt` is generated using `GAGAvatar/inference.py`. Here I've included several examples of tracked avatars for quick testing. The style motion is tracked with EMICA module in `GAGAvatar_track` . Each contains `50*106` dimensional data. `50` is 2 seconds consecutive frames, `106` is `100` expression code and `6` pose code (base+jaw). Here I've included several examples of tracked style motion.
## Training This version modifies the VQVAE part compared to the paper version. The training code and the paper version code are still in preparation and are expected to be released later. ## Acknowledgements We thank Lars Traaholt Vågnes and Emmanuel Iarussi from Simli for the insightful discussions! 🤗 The ARTalk logo was designed by Caihong Ning. Some part of our work is built based on FLAME. We also thank the following projects for sharing their great work. - **GAGAvatar**: https://github.com/xg-chu/GAGAvatar - **GPAvatar**: https://github.com/xg-chu/GPAvatar - **FLAME**: https://flame.is.tue.mpg.de - **EMICA**: https://github.com/radekd91/inferno ## Citation If you find our work useful in your research, please consider citing: ```bibtex @misc{ chu2025artalk, title={ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model}, author={Xuangeng Chu and Nabarun Goswami and Ziteng Cui and Hanqin Wang and Tatsuya Harada}, year={2025}, eprint={2502.20323}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2502.20323}, } ```