| <h2 align='center'>Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis</h2> | |
| <div align='center'> | |
| <a href=""><strong>Tianqi Li</strong></a> | |
| Β· | |
| <a href=""><strong>Ruobing Zheng</strong></a><sup>β </sup> | |
| Β· | |
| <a href=""><strong>Minghui Yang</strong></a> | |
| Β· | |
| <a href=""><strong>Jingdong Chen</strong></a> | |
| Β· | |
| <a href=""><strong>Ming Yang</strong></a> | |
| </div> | |
| <div align='center'> | |
| Ant Group | |
| </div> | |
| <br> | |
| <div align='center'> | |
| <a href='https://arxiv.org/abs/2411.19509'><img src='https://img.shields.io/badge/Paper-arXiv-red'></a> | |
| <a href='https://digital-avatar.github.io/ai/Ditto/'><img src='https://img.shields.io/badge/Project-Page-blue'></a> | |
| <a href='https://huggingface.co/digital-avatar/ditto-talkinghead'><img src='https://img.shields.io/badge/Model-HuggingFace-yellow'></a> | |
| <a href='https://github.com/antgroup/ditto-talkinghead'><img src='https://img.shields.io/badge/Code-GitHub-purple'></a> | |
| <!-- <a href='https://github.com/antgroup/ditto-talkinghead'><img src='https://img.shields.io/github/stars/antgroup/ditto-talkinghead?style=social'></a> --> | |
| <a href='https://colab.research.google.com/drive/19SUi1TiO32IS-Crmsu9wrkNspWE8tFbs?usp=sharing'><img src='https://img.shields.io/badge/Demo-Colab-orange'></a> | |
| </div> | |
| <br> | |
| <div align="center"> | |
| <video style="width: 95%; object-fit: cover;" controls loop src="https://github.com/user-attachments/assets/ef1a0b08-bff3-4997-a6dd-62a7f51cdb40" muted="false"></video> | |
| <p> | |
| β¨ For more results, visit our <a href="https://digital-avatar.github.io/ai/Ditto/"><strong>Project Page</strong></a> β¨ | |
| </p> | |
| </div> | |
| ## π Updates | |
| * [2025.11.12] π₯π₯ We noticed the community's enthusiasm for open-source training code. [Training code](https://github.com/antgroup/ditto-talkinghead/tree/train) is now available. | |
| * [2025.07.11] π₯ The [PyTorch model](#-pytorch-model) is now available. | |
| * [2025.07.07] π₯ Ditto is accepted by ACM MM 2025. | |
| * [2025.01.21] π₯ We update the [Colab](https://colab.research.google.com/drive/19SUi1TiO32IS-Crmsu9wrkNspWE8tFbs?usp=sharing) demo, welcome to try it. | |
| * [2025.01.10] π₯ We release our inference [codes](https://github.com/antgroup/ditto-talkinghead) and [models](https://huggingface.co/digital-avatar/ditto-talkinghead). | |
| * [2024.11.29] π₯ Our [paper](https://arxiv.org/abs/2411.19509) is in public on arxiv. | |
| ## π οΈ Installation | |
| Tested Environment | |
| - System: Centos 7.2 | |
| - GPU: A100 | |
| - Python: 3.10 | |
| - tensorRT: 8.6.1 | |
| Clone the codes from [GitHub](https://github.com/antgroup/ditto-talkinghead): | |
| ```bash | |
| git clone https://github.com/antgroup/ditto-talkinghead | |
| cd ditto-talkinghead | |
| ``` | |
| ### Conda | |
| Create `conda` environment: | |
| ```bash | |
| conda env create -f environment.yaml | |
| conda activate ditto | |
| ``` | |
| ### Pip | |
| If you have problems creating a conda environment, you can also refer to our [Colab](https://colab.research.google.com/drive/19SUi1TiO32IS-Crmsu9wrkNspWE8tFbs?usp=sharing). | |
| After correctly installing `pytorch`, `cuda` and `cudnn`, you only need to install a few packages using pip: | |
| ```bash | |
| pip install \ | |
| tensorrt==8.6.1 \ | |
| librosa \ | |
| tqdm \ | |
| filetype \ | |
| imageio \ | |
| opencv_python_headless \ | |
| scikit-image \ | |
| cython \ | |
| cuda-python \ | |
| imageio-ffmpeg \ | |
| colored \ | |
| polygraphy \ | |
| numpy==2.0.1 | |
| ``` | |
| If you don't use `conda`, you may also need to install `ffmpeg` according to the [official website](https://www.ffmpeg.org/download.html). | |
| ## π₯ Download Checkpoints | |
| Download checkpoints from [HuggingFace](https://huggingface.co/digital-avatar/ditto-talkinghead) and put them in `checkpoints` dir: | |
| ```bash | |
| git lfs install | |
| git clone https://huggingface.co/digital-avatar/ditto-talkinghead checkpoints | |
| ``` | |
| The `checkpoints` should be like: | |
| ```text | |
| ./checkpoints/ | |
| βββ ditto_cfg | |
| βΒ Β βββ v0.4_hubert_cfg_trt.pkl | |
| βΒ Β βββ v0.4_hubert_cfg_trt_online.pkl | |
| βββ ditto_onnx | |
| βΒ Β βββ appearance_extractor.onnx | |
| βΒ Β βββ blaze_face.onnx | |
| βΒ Β βββ decoder.onnx | |
| βΒ Β βββ face_mesh.onnx | |
| βΒ Β βββ hubert.onnx | |
| βΒ Β βββ insightface_det.onnx | |
| βΒ Β βββ landmark106.onnx | |
| βΒ Β βββ landmark203.onnx | |
| βΒ Β βββ libgrid_sample_3d_plugin.so | |
| βΒ Β βββ lmdm_v0.4_hubert.onnx | |
| βΒ Β βββ motion_extractor.onnx | |
| βΒ Β βββ stitch_network.onnx | |
| βΒ Β βββ warp_network.onnx | |
| βββ ditto_trt_Ampere_Plus | |
| βββ appearance_extractor_fp16.engine | |
| βββ blaze_face_fp16.engine | |
| βββ decoder_fp16.engine | |
| βββ face_mesh_fp16.engine | |
| βββ hubert_fp32.engine | |
| βββ insightface_det_fp16.engine | |
| βββ landmark106_fp16.engine | |
| βββ landmark203_fp16.engine | |
| βββ lmdm_v0.4_hubert_fp32.engine | |
| βββ motion_extractor_fp32.engine | |
| βββ stitch_network_fp16.engine | |
| βββ warp_network_fp16.engine | |
| ``` | |
| - The `ditto_cfg/v0.4_hubert_cfg_trt_online.pkl` is online config | |
| - The `ditto_cfg/v0.4_hubert_cfg_trt.pkl` is offline config | |
| ## π Inference | |
| Run `inference.py`: | |
| ```shell | |
| python inference.py \ | |
| --data_root "<path-to-trt-model>" \ | |
| --cfg_pkl "<path-to-cfg-pkl>" \ | |
| --audio_path "<path-to-input-audio>" \ | |
| --source_path "<path-to-input-image>" \ | |
| --output_path "<path-to-output-mp4>" | |
| ``` | |
| For example: | |
| ```shell | |
| python inference.py \ | |
| --data_root "./checkpoints/ditto_trt_Ampere_Plus" \ | |
| --cfg_pkl "./checkpoints/ditto_cfg/v0.4_hubert_cfg_trt.pkl" \ | |
| --audio_path "./example/audio.wav" \ | |
| --source_path "./example/image.png" \ | |
| --output_path "./tmp/result.mp4" | |
| ``` | |
| βNote: | |
| We have provided the tensorRT model with `hardware-compatibility-level=Ampere_Plus` (`checkpoints/ditto_trt_Ampere_Plus/`). If your GPU does not support it, please execute the `cvt_onnx_to_trt.py` script to convert from the general onnx model (`checkpoints/ditto_onnx/`) to the tensorRT model. | |
| ```bash | |
| python scripts/cvt_onnx_to_trt.py --onnx_dir "./checkpoints/ditto_onnx" --trt_dir "./checkpoints/ditto_trt_custom" | |
| ``` | |
| Then run `inference.py` with `--data_root=./checkpoints/ditto_trt_custom`. | |
| ## β‘ PyTorch Model | |
| *Based on community interest and to better support further development, we are now open-sourcing the PyTorch version of the model.* | |
| We have added the PyTorch model and corresponding configuration files to the [HuggingFace](https://huggingface.co/digital-avatar/ditto-talkinghead). Please refer to [Download Checkpoints](#-download-checkpoints) to prepare the model files. | |
| The `checkpoints` should be like: | |
| ```text | |
| ./checkpoints/ | |
| βββ ditto_cfg | |
| βΒ Β βββ ... | |
| βΒ Β βββ v0.4_hubert_cfg_pytorch.pkl | |
| βββ ... | |
| βββ ditto_pytorch | |
| βββ aux_models | |
| β βββ 2d106det.onnx | |
| β βββ det_10g.onnx | |
| β βββ face_landmarker.task | |
| β βββ hubert_streaming_fix_kv.onnx | |
| β βββ landmark203.onnx | |
| βββ models | |
| βββ appearance_extractor.pth | |
| βββ decoder.pth | |
| βββ lmdm_v0.4_hubert.pth | |
| βββ motion_extractor.pth | |
| βββ stitch_network.pth | |
| βββ warp_network.pth | |
| ``` | |
| To run inference, execute the following command: | |
| ```shell | |
| python inference.py \ | |
| --data_root "./checkpoints/ditto_pytorch" \ | |
| --cfg_pkl "./checkpoints/ditto_cfg/v0.4_hubert_cfg_pytorch.pkl" \ | |
| --audio_path "./example/audio.wav" \ | |
| --source_path "./example/image.png" \ | |
| --output_path "./tmp/result.mp4" | |
| ``` | |
| ## π§ Acknowledgement | |
| Our implementation is based on [S2G-MDDiffusion](https://github.com/thuhcsi/S2G-MDDiffusion) and [LivePortrait](https://github.com/KwaiVGI/LivePortrait). Thanks for their remarkable contribution and released code! If we missed any open-source projects or related articles, we would like to complement the acknowledgement of this specific work immediately. | |
| ## βοΈ License | |
| This repository is released under the Apache-2.0 license as found in the [LICENSE](LICENSE) file. | |
| ## π Citation | |
| If you find this codebase useful for your research, please use the following entry. | |
| ```BibTeX | |
| @article{li2024ditto, | |
| title={Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis}, | |
| author={Li, Tianqi and Zheng, Ruobing and Yang, Minghui and Chen, Jingdong and Yang, Ming}, | |
| journal={arXiv preprint arXiv:2411.19509}, | |
| year={2024} | |
| } | |
| ``` | |
| ## π Star History | |
| [](https://www.star-history.com/#antgroup/ditto-talkinghead&Date) | |