Workflow - I2V & T2V with ID-LoRA for consistent voice across video generations

#59
by RuneXX - opened

I2V & T2V Basic - consistent voice with ID-LoRA and reference audio

The workflow adds ID-Lora that lets you use a 5 second reference audio clip to have consistent voice for each video you make
https://id-lora.github.io/

Unlike using custom spoken audio input that strips the ambient sound away, with ID-LoRA you can prompt what the person should say, the background sound, etc
And all it needs is a 5 second reference audio, that you can prompt any dialog from based on the reference audio, giving you full flexibility

(the above video was ran with lowest strength so you can increase the strength for even higher consistency)

Make sure ComfyUI is up to date, the support for ID-Lora was added recently.
Thanks to AviadDahan and the ID-Lora team for the models, and thanks to Kijai for doing his magic ;-)

Download the lora's here:
https://huggingface.co/AviadDahan/LTX-2.3-ID-LoRA-CelebVHQ-3K
https://huggingface.co/AviadDahan/LTX-2.3-ID-LoRA-TalkVid-3K

(might come some updates to the workflow, was a first attempt, so hopefully somewhat correct)

Thank you for your great efforts and amazing work..... but can you explain the differences between the two loras?

Thank you for your great efforts and amazing work..... but can you explain the differences between the two loras?

Not entirely sure, but think the difference is only the dataset used. CelebVHQ is the name of a dataset https://celebv-hq.github.io/, same is TalkVid https://github.com/FreedomIntelligence/TalkVid
Both loras works great though ;-)

thanks for the workflow, but i do have a question: in the upsampling phase, you are NOT using the ID-lORA right? only in 1st pass? is there a reason?

i tried myself but i am struggling really to get good audio consistency and i cannot for the life of me decide if with or without upsampler is better or even without lora, as all versions do have differing voices to be honest, seems more dependent on seed how different than just lora strenght or application.. sadly

let alone face/identity consistency identity_guidance_scale anything > 0 basically destroys the scene, distorts faces etc. :-/

The lora is only for 1st phase, yes.
And not having any issues here. Not had any distortions at all
Will take a look if it could be anything...

The lora is only for 1st phase, yes.
And not having any issues here. Not had any distortions at all
Will take a look if it could be anything...

My reference audio is actually the audio generated in a previous LTX generation. It contains some background noise, sometimes a little music playing, so it's not just the voice, maybe that's it. I will try with some hard voice only. Thanks!

My reference audio is actually the audio generated in a previous LTX generation. It contains some background noise, sometimes a little music playing, so it's not just the voice, maybe that's it. I will try with some hard voice only. Thanks!

ah yes that might influence things, i haven't tried with other than clean vocal input. I'll add the MelBandRoformer nodes to the workflow.
These nodes will extract the vocals only, and remove everything else.
(or you can try yourself if you want https://github.com/kijai/ComfyUI-MelBandRoFormer )

ah using the vocals (only) output of that MelBandRoFormer as reference_audio input to ID-Lora node definately makes a difference, many thanks for pointing that out!

Sign up or log in to comment