How to speed up inference?
I'm using Ditto Talking Head to power a real-time conversational avatar, and it's been working great so far. The main performance bottleneck I'm running into is diffusion inference. I'm already using the TensorRT build, which gave a big speedup, thank you for providing that.
I've experimented with:
- Reducing diffusion steps
- Lowering resolution
- Lowering frame rate
- Chunking / online inference (splitting longer motion generation into sequential chunks)
Does anyone have suggestions on how I can further implement inference speed? Some quality trade-offs can be made.
Only reducing frame rate produced a meaningful latency improvement; reducing diffusion steps didn’t have much effect, which makes me think I'm either hitting another bottleneck or not affecting the internals of the TRT engine the way I expect.
I would love to try INT8 quantization on the Ditto weights to push latency further, but I haven’t been able to get a successful build yet. If anyone has tips (or an example) for quantizing the diffusion component, I’d really appreciate it.
If you'd like to see what I'm working on, here’s the project repo:
https://github.com/brucegarro/realtime-avatar