How to speed up inference?

by brucegarro - opened Dec 2, 2025

Dec 2, 2025

I'm using Ditto Talking Head to power a real-time conversational avatar, and it's been working great so far. The main performance bottleneck I'm running into is diffusion inference. I'm already using the TensorRT build, which gave a big speedup, thank you for providing that.

I've experimented with:

Reducing diffusion steps
Lowering resolution
Lowering frame rate
Chunking / online inference (splitting longer motion generation into sequential chunks)

Does anyone have suggestions on how I can further implement inference speed? Some quality trade-offs can be made.

Only reducing frame rate produced a meaningful latency improvement; reducing diffusion steps didn’t have much effect, which makes me think I'm either hitting another bottleneck or not affecting the internals of the TRT engine the way I expect.

I would love to try INT8 quantization on the Ditto weights to push latency further, but I haven’t been able to get a successful build yet. If anyone has tips (or an example) for quantizing the diffusion component, I’d really appreciate it.

If you'd like to see what I'm working on, here’s the project repo:
https://github.com/brucegarro/realtime-avatar

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment