dual character with reference audio

#116
by jonnytracker - opened

which workflow should i use to do , dual character with dual audio reference.

You can prompt multiple characters, but I assume you want consistent voice across multiple videos or use a particular voice ..

The easy alternatives would be TTS. Where you have 2 reference audio clips, and prompt what they should say. The TTS then feed LTX an audio file matching the voice of your ref. audio, cloning the voice.
https://huggingface.co/RuneXX/LTX-2.3-Workflows/tree/main/Talking-Avatar-TTS
OmniVoice and Fish Audio has dual character built into the workflow, but you can even add more.. 3.. or 4. (Qwen TTS probably can also do multi-speaker, but i think that workflow was made just for single voice clone, I'll update with multi-character if it supports that). I say easy since you can use it over and over, and just prompt the TTS node what to say, using the same 2 audio ref. file each time.

Alternatively, you can use regular custom audio workflow as well, if you make the mp3 externally. If you have dialog on an mp3 already that is not going to be changed for other words inside Comfy.
https://huggingface.co/RuneXX/LTX-2.3-Workflows/tree/main/Custom-Audio
Say your mp3 have 2 people talking. Just prompt something like. ... First the woman talks, and she says "... " . After that the man talks and he says "..."
(transcribing the audio within ".." helps lip-sync movements but you can skip that part, and just write: First the woman talks, then the man talks. Usually works fine as well)
But you would then need to remake another mp3 for a different dialog

There is a 3rd alternative, and thats just doing 1 character at a time. Since the 2 people probably arent going to talk at the exact same time.
And use ID-Lora workflow. Render the video where person 1 talks with one audio ref. And then render other videos where the other person talk with a different audio ref (They can both be in view, or single person view visually, just one person talk at a time).

The benefit of this way, compared to custom audio or TTS, is that you also get LTX do do some ambient sound, background sound, etc etc.
While TTS and custom audio, that would need to be part of the audio file itself (but if you are using some video editor after, not having background sound only voice, can also be beneficial)

To get the idea, can see an example here https://huggingface.co/Kijai/LTX2.3_comfy/discussions/42#69c78aed0d0ec31cb3579d3d (even if I used Qwen TTS for that, it was single audio ref. And one person at a time)
Many ways to Rome I guess ;-)

I've struggled with this and ended up doing the 1 character at a time route since they will step on each other's lines or both mouth the words at the same time. There's a mask involved with the custom audio workflows that I mean to explore if it can divide the screen in half or if it could take a SAM 3.1 dynamic mask for one person, then do a separate one for the other person in one workflow???

Yes having mask might be a good idea actually. For TTS or any dual character setup with external audio.
Like you say,. sometimes LTX get confused to who is talking or if both are. And mask could help with that.
Sam-3 is now natively supported in Comfy, so might add that to the wf ;-)

I have had good luck with "insisting" at who is talking in the prompt. Like "The woman to the left in red dress is talking. The man is silently listening" .. (but havent used it extensively to say if that always work or not)

Here's a couple first with TTS for each character
https://x.com/coach_bate/status/2049529582202528171?s=20

and this using just a prompt in LTX-2.3
https://x.com/coach_bate/status/2031784429593772393?s=20

both were pain to create.

Those look great ;-) but yeah, i can imagine it took a few tries .. specially the 3 person one might have been tricky

Sign up or log in to comment