RUNEXX, question about the LIPSYNC workflow

#60
by Liquidmind111 - opened

If i am doing 10 seconds animations, and the song total lenght is 1 minute, and i cut the audio in 6 segments of 10 seconds, do i need to PROMPT each 6 scenes with the WORDS that person is supposed to SING in that specific scene? or what?

You dont have to prompt, LTX seems to be quite able to lips-sync and "hear" the audio.
But it can help, if you run into something that doesnt quite look right.

Just some some other guy posting a music video, might be good inspiration:
https://huggingface.co/RuneXX/LTX-2.3-Workflows/discussions/61

Hello, please tell me, what if LTX refuses to make a lip sync? It only works on this model: ltx-2.3-22b-distilled_transformer_only_fp8_input_scaled_v3.safetensors. It only works sometimes. If you give it a dev model (fp16) and enable distilled lora, the lip sync disappears. It disappears in a strange way: I2V works if you give it time tags and exact replicas at each time. (and only on scaled_v3) on any other model, the character moves as if it's talking, but the mouth doesn't move. Even on the scaled_v3 model, if you use FLF2V, it doesn't always work.

Can you tell me how to make it lip sync to custom audio? Maybe I'm doing something wrong. Here's my prompt:

[0:00-0:05]:Vampire woman talking to camera, she says in Russian: "Сегодня мы поговорим о столетней войне". Close-up of the woman's face (middle frame view). She stands by the bookshelf, then begins walking confidently forward in high heels. Background — blurred books.
[0:05-0:09]: Vampire woman to camera, she says in Russian: "О том, как вы, люди, почти целый век самозабвенно резали друг друга ради эфемерных целей и чужих идеалов" The woman slowly walks through a gothic corridor. The camera moves backward in front of her, tightly framing her face and upper body in the white fur stole. In the background, walls with candelabras and paintings pass by.
[0:09-0:18]: Vampire woman to camera, she says in Russian: "И о том, как вы, доведя свою глупость до абсолюта, умудрились сжечь на костре единственную девушку, которая стала знаменем мира" The woman enters the living room with a fireplace, smoothly sits down in a luxurious armchair. As she settles into the chair, the camera slowly pulls back (dolly out), revealing the full scene: the burning fireplace, gothic window, and her figure in the armchair.

LIpsynk with custom audio

I upload an audio track and the lip sync is either not there at all or it's there at the beginning but then it disappears by the end of the video. I give it 2 more frames, the initial one and the final one. If I disable the custom audio, she speaks, but I need my voice

Lip-sync can be tricky sometimes. Not sure where the "magic bullet" is, but some image inputs seems to invite to a narrator voice-over type of video.
Also seen some say that if the sound is mono and "center" (aka, it sounds like a narrator), it can help to convert it to stereo.

As for myself, i usually double down on the prompting if i get a stubborn one that refuses ;-) such as "and then she talks, and she says: " ..... ". With expressive face and lip movements as she speaks the words" (in other words "insisting" she talks with multiple ways saying so: "and then she talks and she says .... with with lips movement in perfect sync with the audio".. and similar. When i get a stubborn one that is. Often its not needed to be this explicit)

It can also help to be very specific as to who is talking. So the model really knows who you mean,. and doesnt resort to a voice over as the "easier choice". Such as "The woman in black uniform with hat talks, and she says.." or "the woman to the left looks at the viewer and she talks, she says...."

And lastly, there is a lora that can help when really really stuck... but its made for LTX-2.0, it should still work, but for the older loras you might have to increase the strength to say 1.5 or even 2.
https://huggingface.co/MachineDelusions/LTX-2_Image2Video_Adapter_LoRa

I don't think it matters what model you use though, but I haven't done any 1 to 1 testing, so I cant be sure about that ;-) but would be a little strange if that matters

And to add, since you said Dev model was the most troublesome, it might help to increase the steps perhaps

I just did many lip-sync ones myself today, with the Dev model. Didnt have any issues. (I was testing the new ID Lora, that can use audio as a short reference input instead of custom audio. And just prompt what she says with the voice sounding like your short 5 second reference audio)
https://huggingface.co/RuneXX/LTX-2.3-Workflows/discussions/59

Thank you. The sound generated by ID Lora is better than the standard sound, but it still doesn't match the studio sound of a real person. If you're not making a meme or a funny TikTok with a cartoon character and it's not in English, it's not suitable. That's why I'm struggling with the custom sound.

if it's not in English, it's not suitable. That's why I'm struggling with the custom sound.

Try the TalkVid version. For support for other languages that might depend on the training data.

For https://celebv-hq.github.io/ I dont see any mention of languages, and since its based on clips of celebrities, i guess its western ones, that speak English.
But thats just my guess. I dont really know since it doesnt say on the dataset what languages

The TalkVid dataset has a few languages mentioned https://github.com/FreedomIntelligence/TalkVid (Covers 15 languages - : English, Chinese, Arabic, Polish, German, Russian, French, Korean, Portuguese, Japanese, Thai, Spanish, Italian, Hindi)

For the TalkVid lora : https://huggingface.co/AviadDahan/LTX-2.3-ID-LoRA-TalkVid-3K

if it's not in English, it's not suitable. That's why I'm struggling with the custom sound.

I am making a workflow that uses TTS. It lets you generate the audio first, with your prompted what to say.
And this is given to LTX as an audio clip input .

This works great for sure, its the more "old school" way of doing it (it was supported also in LTX-2.0 and i had a Qwen-TTS workflow example for that). And does not use any lora.

It works with any TTS, but the ones I might make workflows for is Qwen-TTS, FishAudio 2 Pro and Microsoft VibeVoice. Maybe IndexTTS as well:
But it super easy to change the workflow, to your own prefered TTS nodes

Fish Audio S2 Pro is the most advanced multimodal model developed by Fish Audio. Trained on over 10 million hours of audio data covering more than 80 languages,

So that should likely cover the language you want to use. And all of them are of course free open source models.

hey RUNE, sorry, im lost here - is this the workflow to add like music and make someone sing while "walking" for example? "LTX-2.3_-_V2V_Just_Talk_add_lipsynced-voice_to_any_video.json
"

if so, where so i load the audio? i can only see to add a video.....

LTX-2.3-V2V_Just_Talk_add_lipsynced-voice_to_any_video.json is more meant to add voice to a silent video. Lets say a video made with Wan for example. With this workflow it masks the face area, and generate a voice from your prompt where LTX "redo" the face of the Wan video to make it lip-sync to the prompt you gave. So that the Wan original video gets audio (in addition to the spokent words, LTX can/will also add other sound... like background ambient, sound fx etc)

For singing you would have to connect a "custom audio" input.

Sign up or log in to comment