ditto-talkinghead / README.md
digital-avatar's picture
release training code
e4a2f60 verified
<h2 align='center'>Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis</h2>
<div align='center'>
<a href=""><strong>Tianqi Li</strong></a>
Β·
<a href=""><strong>Ruobing Zheng</strong></a><sup>†</sup>
Β·
<a href=""><strong>Minghui Yang</strong></a>
Β·
<a href=""><strong>Jingdong Chen</strong></a>
Β·
<a href=""><strong>Ming Yang</strong></a>
</div>
<div align='center'>
Ant Group
</div>
<br>
<div align='center'>
<a href='https://arxiv.org/abs/2411.19509'><img src='https://img.shields.io/badge/Paper-arXiv-red'></a>
<a href='https://digital-avatar.github.io/ai/Ditto/'><img src='https://img.shields.io/badge/Project-Page-blue'></a>
<a href='https://huggingface.co/digital-avatar/ditto-talkinghead'><img src='https://img.shields.io/badge/Model-HuggingFace-yellow'></a>
<a href='https://github.com/antgroup/ditto-talkinghead'><img src='https://img.shields.io/badge/Code-GitHub-purple'></a>
<!-- <a href='https://github.com/antgroup/ditto-talkinghead'><img src='https://img.shields.io/github/stars/antgroup/ditto-talkinghead?style=social'></a> -->
<a href='https://colab.research.google.com/drive/19SUi1TiO32IS-Crmsu9wrkNspWE8tFbs?usp=sharing'><img src='https://img.shields.io/badge/Demo-Colab-orange'></a>
</div>
<br>
<div align="center">
<video style="width: 95%; object-fit: cover;" controls loop src="https://github.com/user-attachments/assets/ef1a0b08-bff3-4997-a6dd-62a7f51cdb40" muted="false"></video>
<p>
✨ For more results, visit our <a href="https://digital-avatar.github.io/ai/Ditto/"><strong>Project Page</strong></a> ✨
</p>
</div>
## πŸ“Œ Updates
* [2025.11.12] πŸ”₯πŸ”₯ We noticed the community's enthusiasm for open-source training code. [Training code](https://github.com/antgroup/ditto-talkinghead/tree/train) is now available.
* [2025.07.11] πŸ”₯ The [PyTorch model](#-pytorch-model) is now available.
* [2025.07.07] πŸ”₯ Ditto is accepted by ACM MM 2025.
* [2025.01.21] πŸ”₯ We update the [Colab](https://colab.research.google.com/drive/19SUi1TiO32IS-Crmsu9wrkNspWE8tFbs?usp=sharing) demo, welcome to try it.
* [2025.01.10] πŸ”₯ We release our inference [codes](https://github.com/antgroup/ditto-talkinghead) and [models](https://huggingface.co/digital-avatar/ditto-talkinghead).
* [2024.11.29] πŸ”₯ Our [paper](https://arxiv.org/abs/2411.19509) is in public on arxiv.
## πŸ› οΈ Installation
Tested Environment
- System: Centos 7.2
- GPU: A100
- Python: 3.10
- tensorRT: 8.6.1
Clone the codes from [GitHub](https://github.com/antgroup/ditto-talkinghead):
```bash
git clone https://github.com/antgroup/ditto-talkinghead
cd ditto-talkinghead
```
### Conda
Create `conda` environment:
```bash
conda env create -f environment.yaml
conda activate ditto
```
### Pip
If you have problems creating a conda environment, you can also refer to our [Colab](https://colab.research.google.com/drive/19SUi1TiO32IS-Crmsu9wrkNspWE8tFbs?usp=sharing).
After correctly installing `pytorch`, `cuda` and `cudnn`, you only need to install a few packages using pip:
```bash
pip install \
tensorrt==8.6.1 \
librosa \
tqdm \
filetype \
imageio \
opencv_python_headless \
scikit-image \
cython \
cuda-python \
imageio-ffmpeg \
colored \
polygraphy \
numpy==2.0.1
```
If you don't use `conda`, you may also need to install `ffmpeg` according to the [official website](https://www.ffmpeg.org/download.html).
## πŸ“₯ Download Checkpoints
Download checkpoints from [HuggingFace](https://huggingface.co/digital-avatar/ditto-talkinghead) and put them in `checkpoints` dir:
```bash
git lfs install
git clone https://huggingface.co/digital-avatar/ditto-talkinghead checkpoints
```
The `checkpoints` should be like:
```text
./checkpoints/
β”œβ”€β”€ ditto_cfg
β”‚Β Β  β”œβ”€β”€ v0.4_hubert_cfg_trt.pkl
β”‚Β Β  └── v0.4_hubert_cfg_trt_online.pkl
β”œβ”€β”€ ditto_onnx
β”‚Β Β  β”œβ”€β”€ appearance_extractor.onnx
β”‚Β Β  β”œβ”€β”€ blaze_face.onnx
β”‚Β Β  β”œβ”€β”€ decoder.onnx
β”‚Β Β  β”œβ”€β”€ face_mesh.onnx
β”‚Β Β  β”œβ”€β”€ hubert.onnx
β”‚Β Β  β”œβ”€β”€ insightface_det.onnx
β”‚Β Β  β”œβ”€β”€ landmark106.onnx
β”‚Β Β  β”œβ”€β”€ landmark203.onnx
β”‚Β Β  β”œβ”€β”€ libgrid_sample_3d_plugin.so
β”‚Β Β  β”œβ”€β”€ lmdm_v0.4_hubert.onnx
β”‚Β Β  β”œβ”€β”€ motion_extractor.onnx
β”‚Β Β  β”œβ”€β”€ stitch_network.onnx
β”‚Β Β  └── warp_network.onnx
└── ditto_trt_Ampere_Plus
β”œβ”€β”€ appearance_extractor_fp16.engine
β”œβ”€β”€ blaze_face_fp16.engine
β”œβ”€β”€ decoder_fp16.engine
β”œβ”€β”€ face_mesh_fp16.engine
β”œβ”€β”€ hubert_fp32.engine
β”œβ”€β”€ insightface_det_fp16.engine
β”œβ”€β”€ landmark106_fp16.engine
β”œβ”€β”€ landmark203_fp16.engine
β”œβ”€β”€ lmdm_v0.4_hubert_fp32.engine
β”œβ”€β”€ motion_extractor_fp32.engine
β”œβ”€β”€ stitch_network_fp16.engine
└── warp_network_fp16.engine
```
- The `ditto_cfg/v0.4_hubert_cfg_trt_online.pkl` is online config
- The `ditto_cfg/v0.4_hubert_cfg_trt.pkl` is offline config
## πŸš€ Inference
Run `inference.py`:
```shell
python inference.py \
--data_root "<path-to-trt-model>" \
--cfg_pkl "<path-to-cfg-pkl>" \
--audio_path "<path-to-input-audio>" \
--source_path "<path-to-input-image>" \
--output_path "<path-to-output-mp4>"
```
For example:
```shell
python inference.py \
--data_root "./checkpoints/ditto_trt_Ampere_Plus" \
--cfg_pkl "./checkpoints/ditto_cfg/v0.4_hubert_cfg_trt.pkl" \
--audio_path "./example/audio.wav" \
--source_path "./example/image.png" \
--output_path "./tmp/result.mp4"
```
❗Note:
We have provided the tensorRT model with `hardware-compatibility-level=Ampere_Plus` (`checkpoints/ditto_trt_Ampere_Plus/`). If your GPU does not support it, please execute the `cvt_onnx_to_trt.py` script to convert from the general onnx model (`checkpoints/ditto_onnx/`) to the tensorRT model.
```bash
python scripts/cvt_onnx_to_trt.py --onnx_dir "./checkpoints/ditto_onnx" --trt_dir "./checkpoints/ditto_trt_custom"
```
Then run `inference.py` with `--data_root=./checkpoints/ditto_trt_custom`.
## ⚑ PyTorch Model
*Based on community interest and to better support further development, we are now open-sourcing the PyTorch version of the model.*
We have added the PyTorch model and corresponding configuration files to the [HuggingFace](https://huggingface.co/digital-avatar/ditto-talkinghead). Please refer to [Download Checkpoints](#-download-checkpoints) to prepare the model files.
The `checkpoints` should be like:
```text
./checkpoints/
β”œβ”€β”€ ditto_cfg
β”‚Β Β  β”œβ”€β”€ ...
β”‚Β Β  └── v0.4_hubert_cfg_pytorch.pkl
β”œβ”€β”€ ...
└── ditto_pytorch
β”œβ”€β”€ aux_models
β”‚ β”œβ”€β”€ 2d106det.onnx
β”‚ β”œβ”€β”€ det_10g.onnx
β”‚ β”œβ”€β”€ face_landmarker.task
β”‚ β”œβ”€β”€ hubert_streaming_fix_kv.onnx
β”‚ └── landmark203.onnx
└── models
β”œβ”€β”€ appearance_extractor.pth
β”œβ”€β”€ decoder.pth
β”œβ”€β”€ lmdm_v0.4_hubert.pth
β”œβ”€β”€ motion_extractor.pth
β”œβ”€β”€ stitch_network.pth
└── warp_network.pth
```
To run inference, execute the following command:
```shell
python inference.py \
--data_root "./checkpoints/ditto_pytorch" \
--cfg_pkl "./checkpoints/ditto_cfg/v0.4_hubert_cfg_pytorch.pkl" \
--audio_path "./example/audio.wav" \
--source_path "./example/image.png" \
--output_path "./tmp/result.mp4"
```
## πŸ“§ Acknowledgement
Our implementation is based on [S2G-MDDiffusion](https://github.com/thuhcsi/S2G-MDDiffusion) and [LivePortrait](https://github.com/KwaiVGI/LivePortrait). Thanks for their remarkable contribution and released code! If we missed any open-source projects or related articles, we would like to complement the acknowledgement of this specific work immediately.
## βš–οΈ License
This repository is released under the Apache-2.0 license as found in the [LICENSE](LICENSE) file.
## πŸ“š Citation
If you find this codebase useful for your research, please use the following entry.
```BibTeX
@article{li2024ditto,
title={Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis},
author={Li, Tianqi and Zheng, Ruobing and Yang, Minghui and Chen, Jingdong and Yang, Ming},
journal={arXiv preprint arXiv:2411.19509},
year={2024}
}
```
## 🌟 Star History
[![Star History Chart](https://api.star-history.com/svg?repos=antgroup/ditto-talkinghead&type=Date)](https://www.star-history.com/#antgroup/ditto-talkinghead&Date)