Workflow : V2V - Just Talk - Prompt lip-synced voice and sounds to any silent video

#52
by RuneXX - opened
Owner

Workflow: V2V - Just Talk - Prompt lip-synced voice and sounds to any silent video*
Add voice and sounds to your silent videos with lip-sync.

It has a few setting tweaks to play around with, such as facemask vs no facemask (how strict to adhere to the input video), as well as how strong influence the end of video should have. These settings will determine how much freedom the model has to change things. Too strict can look a bit unnatural.

Plus an extra feature of being able to also extend your silent video, since most such (from Wan etc) are probably short clips.

A little bit experimental, so might come updates to the workflow.. .but something to play around with for now ;-)

RuneXX changed discussion title from Workflow : V2V - Just Talk - Prompt lip-synced voice and sounds to any silent video* to Workflow : V2V - Just Talk - Prompt lip-synced voice and sounds to any silent video
Owner
β€’
edited Mar 22

With extended video (optional part of the workflow)

Is it possible to make it so that there are no changes except to the masked part?

Owner

Is it possible to make it so that there are no changes except to the masked part?

Should be with a bit of masking. The mask in the above workflow is made a bit weak to ensure lip-sync, but with a proper inpaint like masking it should be doable ;-)

Nice, for the foley / sound generation ( v2v ) is they're a way to simply connect the audio generated to the video combine node instead of creating a new video from the input one

Owner
β€’
edited Mar 23

Is it possible to make it so that there are no changes except to the masked part?

A little inpainting test.. seems to work. Will try find some sweet spot for details etc.

Prompt: "blue eyes and glasses" ;-) with mask around the eyes area. Not 100% just the masked area, but close (the timing is a little different in the example above, but thats my fault. One video was 24fps, other 25fps)

Owner
β€’
edited Mar 23

Nice, for the foley / sound generation ( v2v ) is they're a way to simply connect the audio generated to the video combine node instead of creating a new video from the input one

Thats what it already does (the foley workflow). It does generate a video (since its a video model), but the video part is disregarded at the end, only the audio is used
(except if you also extend the video, then the new added video parts is also from LTX)

This workflow is almost exactly what I needed. I am testing it for my video, however I noticed a lora loader of ANIMTEDDIFF\v3_sd15_adapter.ckpt. Is it required? Is it the model below?
https://huggingface.co/guoyww/animatediff/blob/main/v3_sd15_adapter.ckpt
Where should I put it in comfyui, shall I put it in the Loras folder?

I am sorry but where can I input the audio?
I can see the load video node which I can replace with my video, but for audio it seems to use empty audio latent rather than loading the voice audio, I am sure I am missing something. Any help is greatly appreciated.

Thanks for advance.

I noticed a lora loader of ANIMTEDDIFF\v3_sd15_adapter.ckpt. Is it required? Is it the model below?

No thats just a "placeholder" one by accident. I made a secondary lora loader in this workflow where the audio part is muted (for user made loras not trained on audio data).
It should be none loaded in that unless you want to load some lora. I'll take a look at that workflow see if i can set it to "off" as default.
For now just select it and click CTRL+B to bypass it

I am sorry but where can I input the audio?

This particular workflow you just prompt. For what audio you want to have
Since I am already updating it for the lora loader (see post above), I might add an optional custom audio input as well ;-)

Since I am already updating it for the lora loader (see post above), I might add an optional custom audio input as well ;-)

This is going to be cool. I am looking forward to it.


PROMPT GENERATED VOICE :

CUSTOM AUDIO - where you can use your own audio file as input (in the demo a voice audio mp3 generated in Pocket TTS) :


UPDATED WORKFLOWS

  1. A new version where you can use custom audio file as the audio input to lip-sync to
  2. Updated prompt to speak version removing a confusing lora loader

And the workflow seems to work even better with the v.1.1 distilled model ;-) but only did a few runs to create example videos

(I'll update the Sam3 version too - where you just prompt what to mask - but will wait for Sam3 to be natively supported in ComfyUI, it will be real real soon)

Would it be possible to incorporate the ID-LoRA into this workflow? That way character audio can be consistent across multiple scenes. I think that would probably make it the most useful LTX 2.3 workflow for me out of any in existence.

Would it be possible to incorporate the ID-LoRA into this workflow? That way character audio can be consistent across multiple scenes. I think that would probably make it the most useful LTX 2.3 workflow for me out of any in existence.

that might be a good idea actually ;-) Should be doable. Will give it a try

Hello RuneXX,the workflow is very awesome! But i met a problem.When i use one pass method,the video seems to degrade and shift a little bit,while meantain the face consistency best;When i use two pass method,the video doesn't degrade or shift anymore,but hard to meantain the consistency well. So is there a way to make the one pass video without degradation or shift?

How long are your extension? LTX (or any video model for that matter), will degrade if you go beyond its training. For LTX that is up to 20 seconds (and perhaps the sweet spot is 5-15 sec)
That aside, if using "only" one pass and bypassing the 2nd sampler, you can set higher step count.

I often use the Model Shift + Basic Scheduler node combo and set steps to around say 10 or so. Often those extra steps iron a few things, like "blurry teeth" etc
I think you find those two under the "Manual Sigma" node. Since I leave a few things hidden in the workflow for those that want to change something easily
(been tempted to leave as the default in the wf many times, but decided at keeping them as per LTX recommended settings)

If its there, its basically already connected to the model, so simply connect to the sigma output to the sampler. I'll check if its in the workflow, if not make a screenshot on how to add.

Although I could be misunderstanding what you meant by "degrade and shift".

UPDATE:
checked the workflow and those alternative nodes are "hiding" under the manual sigma node.
Try use those instead for the sigma and set to a bit more steps, 10-15 ish

You can also try regular euler_ancestral sampler, it can be a little less "harsh" for a single pass, alternatively LCM

image

And a little side note, its not an "upscaler" workflow, even though with 2 passes it might pass for ok.
But ideally you should set the max input size node at the bottom of the workflow to max the size of the input video (or smaller).

Ran a little single pass test, seemed ok. But might depend a lot on the input video (those i tried seemed ok)

Tried a very difficult one as well, just to see if something other than pretty girls would fail ;-)
Not the best lip-sync when its a weird creature like that, but still works ok i guess ;-)

How long are your extension? LTX (or any video model for that matter), will degrade if you go beyond its training. For LTX that is up to 20 seconds (and perhaps the sweet spot is 5-15 sec)
That aside, if using "only" one pass and bypassing the 2nd sampler, you can set higher step count.

I often use the Model Shift + Basic Scheduler node combo and set steps to around say 10 or so. Often those extra steps iron a few things, like "blurry teeth" etc
I think you find those two under the "Manual Sigma" node. Since I leave a few things hidden in the workflow for those that want to change something easily
(been tempted to leave as the default in the wf many times, but decided at keeping them as per LTX recommended settings)

If its there, its basically already connected to the model, so simply connect to the sigma output to the sampler. I'll check if its in the workflow, if not make a screenshot on how to add.

Although I could be misunderstanding what you meant by "degrade and shift".

UPDATE:
checked the workflow and those alternative nodes are "hiding" under the manual sigma node.
Try use those instead for the sigma and set to a bit more steps, 10-15 ish

You can also try regular euler_ancestral sampler, it can be a little less "harsh" for a single pass, alternatively LCM

image

Thank you ,i will definitely try the Model Shift + Basic Scheduler node combo.Thank you for taking time. I will let you know if it solved the problem.
What i meant by"degrade and shift" is there is some subtle noise (nearly invisible,but could be noticed)on the video.
If you have time,please check my videos on civitai.

The original wan2.2 video:
https://civitai.red/images/129537168

The lip-synced ltx2.3 video:
https://civitai.red/images/129537370

Yes,Model Shift + Basic Scheduler node combo and set steps to 15 will greatly improve the quality.The noise is almost unnoticeable now!Thanks again, i may stick to this method! πŸ˜ƒ

That looks great ;-)

model Shift + Basic Scheduler node combo and set steps to 15 will greatly improve the quality.

And yes its my own personal favorite to set that. But i dont put that as default in the workflow, since so many pointed out that it was not as how LTX themselves had it ;-)
So the default in the wf is stock LTX settings (manual 8 step sigma, euler_cfg etc)

But sometimes you gotta break the rules ;-) and 15 steps can do miracles when you need to squeeze a bit more out of the model

That being said, the "wobble colors" you had with manual sigma was a bit unusual. Might be something wrong somewhere

Edit:
I actually see that in my own now, that you have "pointed it out"
https://huggingface.co/RuneXX/LTX-2.3-Workflows/discussions/52#69f934a439f4d4accd602ef3 (this one is with stock LTX settings, no extra steps. Tiny bit of color shifting wobble stuff).
Will check the wf if there is anything.... (was meaning to update the wf anyways, since now ComfyUI supports Sam3 masking natively)

Meanwhile 15 steps seemed to work great ;-)

The lip-synced ltx2.3 video:
https://civitai.red/images/129537370

Looks great ;-)
I just wanted to ask if you are using some user made lora in that one?
Since while it looks great, there is a tiny bit of noise in the audio..

(and also make sure you are using the v1.1 version of LTX if using distilled model.. Same with the distilled lora v1.1 if using dev model. Since version 1.1 has improved audio)

But this noize is usually from user made loras not trained on audio. The fix is to load such loras in KJNodes "LTX Advanced Lora Loader" and mute all the audio parts

The lip-synced ltx2.3 video:
https://civitai.red/images/129537370

Looks great ;-)
I just wanted to ask if you are using some user made lora in that one?
Since while it looks great, there is a tiny bit of noise in the audio.. Not a lot, so dont worry about the post, it looks great ;-) was just curious if you had some lora

(and also make sure you are using the v1.1 version of LTX if using distilled model.. Same with the lora v1.1 if using Kijai's... V.1.1 has improved audio)

But this noize is usually from user made loras not trained on audio. The fix is to load such loras in KJNodes "LTX Advanced Lora Loader" and mute all the audio parts

I only used distilled lora "ltx-2.3-22b-distilled-lora-dynamic_fro09_avg_rank_105_bf16"for checkpoint" LTX2.3 10Eros".But i also tried the checkpoint" ltx-2.3-22b-distilled_transformer_only_bf16" without any other lora,the noise still shows as well.But i think it's acceptable now thanks to the Basic Scheduler method. :D

And also, i found the following workflow very useful for clean up the rest flaws in images.Just change all the stuff for LTX2.3,then import the video ,and you will get even better result!!
https://github.com/Lightricks/ComfyUI-LTXVideo/blob/master/example_workflows/2.0/LTX-2_V2V_Detailer.json

yes it was not a lot of noise. All good ;-)

For the eros model you should use the fp8 transformer version in the workflows here (not the "learned" checkpoint version. Unless you change the model loaders to checkpoint loader etc).
But no big deal, the noise was very minimal ;-) Nothing to lose sleep over hehe

Sign up or log in to comment