Workflow - Custom audio (lipsync, sing-along etc)

#35
by RuneXX - opened

A workflow for those who want to try some with their own custom audio ;-)

I2V and T2V - Custom Audio: https://huggingface.co/RuneXX/LTX-2-Workflows/blob/main/LTX-2%20-%20I2V%20and%20T2V%20Basic%20(Custom%20Audio).json

Examples using Text to Video (T2V)

The T2V workflow link got truncated at the .json part 😅

The T2V workflow link got truncated at the .json part 😅

ah yes, thanks for the heads up ;-) updated it

RuneXX changed discussion title from Workflow for custom audio (lipsync, sing-along etc) to Workflow - Custom audio (lipsync, sing-along etc)

Thank u for ur clean workflows!

I have an issue with i2v, not every image wants to generate lipsync. Some great and some only strange zoom in videos with expressions. I tried loras for stable camera but still problem persist. Also darker images have issue with periodicaly making strange contrast like stabilizing exposure or something…. But I dont know if the problem is within the models and implementation in comfyui or the workflow itself.
Also while trying resolution 1080p the upscaler is messing up colors for some reason (looks like bad color matching or something), do I need to change something aside the width and height?

Why the model sometimes talks, other times is a narrative voice, seems to be down the the model itself.
What i found is that if you explicitly in a sequence order prompt what you want to happen, including dialog. It seems to work better. Things like "then the woman looks towards the camera and she talks, and say "......"

For the upscale, did you perhaps run Dev model with more steps / higher cfg in part 1 (without the distilled lora), but in the upscaler part not use the distilled lora? it needs the lora in step 2 since the upscaler is 8 steps at cfg 1.

I will try with the prompting the dialogue from the audio - wonder why it happens in some scenerios. And it's so strange like specific photos , no matter what I do they don't work great but with others it works everytime. Like it depends on the picture?

With the upscaling I tried many combinations - I tried also kijai models, original models etc:

  • dev and distilled models
  • with loras: distilled, static camera, detailer etc.
  • I used distilled lora in both steps, only on the first part and only on the second part - like every combinations.
  • I used ur workflow as it is, without any changes aside changing the resolution to 1080p and still the upscaler version was messed up. Only works 720p for me.

I will test more today.

yes I seen the same. That some photos are more tricky than others.
I usually do 720p, will try 1080p and see if its something there

On your difficult photos to make talk, you could try this lora:
https://huggingface.co/MachineDelusions/LTX-2_Image2Video_Adapter_LoRa

Its just a theory, but it seemed to help a rare time i had an image that was challenging, and the lora also claims it fix/address this issue.
Could be interesting if it worked for you as well ;-)

(the lora also makes K-Sampler work, but i found that less interesting, than making "difficult" images move correctly. But might make a simple 1-pass workflow with K-Sampler, for the 'nostalgic' ones, that want it as Wan-like as possible ;-))

Also while trying resolution 1080p the upscaler is messing up colors for some reason (looks like bad color matching or something), do I need to change something aside the width and height?

Did a 1080p test run with a colorful image. Seems to work all well for me at least.
Little unsure then, why yours is not. Perhaps double check the models, and that if using dev without distilled lora in step one, add the distilled lora in step 2.
You can also lower the strength a bit on the first frame input at 2nd pass upscaler group. For example set that node to 0.8 or something (depending on ref image, might help)

I updated the workflow with a lora loader node at step 2 (for cases where you run first pass sampler at 20+ steps and cfg 4 with Dev model and no distilled lora).
Just to make that part more clear (+ added a few things like a power lora loader, I2V and T2V in same workflow to make it more flexible, and cut down a little on the number of workflows)

I2V and T2V : https://huggingface.co/RuneXX/LTX-2-Workflows/blob/main/LTX-2%20-%20I2V%20and%20T2V%20Basic%20(Custom%20Audio).json

Thank u for ur updated workflow and ur patient!
I tried it and still no luck. But I found some new things. The only thing I changed is the models, cuz in ur workflow u use GGUF so I just changed to normal models.
I render it with all ur settings, I didn't change anything and it turns out quite ok... but I saw some little discolorations on reds. If I do video up to 9 seconds it's the same results as 5 seconds, but the longer the video and movements the more visible discolorations.
But then I tried to render 15 seconds video. And this is where things happen. All colors are messed up. So I think upscaler / 2pass don't quite do good with color matching for me. Only 1st pass is doing it correctly, and the 2nd pass is messing up colors...

Will try some longer videos see i can get the same ;-)
If there is some error there, or if its the model on a tinted video input or something else

So strange, i cant reproduce that at all. Everything looks exactly the same after 15 seconds ...
This is a very unnatural and tinted input image, so i thought that might be it.. but seems ok my end, even with that extreme tint input

So odd, your 9 second video is fine but the 15 second video she has green teeth after 2 seconds, its starting to degrade right away. But worked all good my end.
I dont know what is different between your 9 and 15 second ones. If its something in the setup or models loaded, or loras or ... hard to guess.

But i will try with other models, see if i can reproduce somehow .. maybe the fp4 one is "weaker" or something

Yes, that is true but u can see discolorations on her hands and pants and belly. I have this on 720p aswell (like on ur example), but on 1080p only till 9-10 seconds it's quite similar (but not as in 720p) and after that disco colors. I wonder why it is happening. I rendered it in 1080p only 1st pass without upscaling and it looks as it should - no discolorations.
I redownloaded models but still the same results - with Kijai's and original models.
I wonder if maybe some nodes r causing this or some versions of env libraries? Or maybe color profile of the image? Hopefully it's not the upscaling model itself...

edit:
I managed to stabilize the "disco" colors with 1080p on 15 second video with lora detailer so I have results like you on 1080p but still after 2nd pass it has mismatch the colors on hands, legs, belly to more greyish/yellowish

could be that some improvements could be made to the nodes or even an error, its been so many workflows. Will debug a bit.
Out of curiosity, what happens if you use a "normal" color image, instead of the heavily tinted one? or maybe you did that on purpose to better show what you meant.

But will try some here too, see if there is something to be done or changed ;-) For example you could try the ref_image instead of compressed_image at the upscale node etc.
Will try some too, see if i can figure it out

And also curious if its only with the custom audio workflow (since this thread is about that) or in any and all workflows.

In case you want to try a more "normal" image, but maybe the whole tint thing was on purpose:
1769705583

It's image generated in nano banana and I didn't color grading it but I think it looks quite natural just with warmer tones. But if I use more natural image it still has issue, just in different areas or different color mismatches. I think colors here doesn't matter.
I checked it on ur color edited version and it's still has issues :D.

ok will try some here too, see if i can figure it out ;-)

(maybe your girl secretly wants to be Gamora haha)

If only this was bugged, but no, any image I tried T-T

Just to rule out any workflow errors, i ran the same image in the stock default workflow from ComfyUI, unchanged, with the stock models.
And the result is the same.

So then its a bit more mysterious. If its something with the model itself. Or the vae, or the upscaler.
Or if you need to do something different if rendering 1080p videos longer than 10 seconds (but the model itself handles 4K at 50fps, so should be childsplay to render 1080p)

But will try see if its something overlooked, or if different sampler works better etc

videoframe_9113

May be external factors like buggy driver? 🤔

Btw, does similar issue also happened on Wan?

May be external factors like buggy driver? 🤔

could be all sorts of things, sage attention i saw some post saying it gave Z-image bad results . But will try see if i can find something that works nice on long 1080p
(and I did some videos like that before with no issues, oddly enough.. will check if anything was different)

I do see there is a hardcoded node in all the workflows "Resize Images by Longer Edge", (mine included, but also comfyui, and LTX themselves, set to 1536px ). Some suggested to set this at the longest length of the base image. Not sure if that matters, but worth a try i guess ;-)

(and I did some videos like that before with no issues, oddly enough.. will check if anything was different)

So i never had this issue before because i didnt do portrait mode in 1080p
It seems like in 16:9 mode, everything is fine. Did a few runs, and no issues

(i forced it to 20 second to do an extra long test as well, despite my audio being only 14 seconds.. so thats why she silently keep talking ;-))

Happens in default workflows in comfy too, so i guess its something with the vertical mode only.

May be external factors like buggy driver? 🤔

Btw, does similar issue also happened on Wan?

Wan doesn't have it's upscaler so I never had issue. On Wan I upscaled videos with other upscaler and never had issue like this.

May be external factors like buggy driver? 🤔

could be all sorts of things, sage attention i saw some post saying it gave Z-image bad results . But will try see if i can find something that works nice on long 1080p
(and I did some videos like that before with no issues, oddly enough.. will check if anything was different)

I do see there is a hardcoded node in all the workflows "Resize Images by Longer Edge", (mine included, but also comfyui, and LTX themselves, set to 1536px ). Some suggested to set this at the longest length of the base image. Not sure if that matters, but worth a try i guess ;-)

I will try aswell some real photos maybe but I feel like it's not cause of an image, it's just random I think.
I tried different "Resize Images by Longer Edge" sizes but with no luck, I even disable it and still the same results XD

(and I did some videos like that before with no issues, oddly enough.. will check if anything was different)

So i never had this issue before because i didnt do portrait mode in 1080p
It seems like in 16:9 mode, everything is fine. Did a few runs, and no issues

(i forced it to 20 second to do an extra long test as well, despite my audio being only 14 seconds.. so thats why she silently keep talking ;-))

Happens in default workflows in comfy too, so i guess its something with the vertical mode only.

Hmm, that is interesting. I will try the horizontal option. Atleast now we know there is some kind of issue. Maybe it will be fixed cuz I saw that LTXv2 team will upload multi modal guider and new VAE so it may fix some things.

Yes i saw a thread on their own repro with users having problems with verticals.
Maybe some fix will come, unless the model training data lacks hq verticals. But could be as simple as the vae.

For now, you can simply add a color match node from KJnodes though. While not a perfect solution, it should at least help a bit (and if its too strong, it has a strength setting)

image

Yes i saw a thread on their own repro with users having problems with verticals.
Maybe some fix will come, unless the model training data lacks hq verticals. But could be as simple as the vae.

For now, you can simply add a color match node from KJnodes though. While not a perfect solution, it should at least help a bit

image

I tried color match from KJ on this problem with not much luck 😄
I tested horizontal 1080p and it worked great - no problems with colors. So maybe yes, it strict to some resolutions? Also I wanted to try 2k horizontal render...and it turns out it is also messed up 🥳 so 720p works great, 1080p on horizontals is ok but above it it's not with the upscaler, it's messing up colors 😆. Wonder if 4k would work but I don't think I am able to render 15 seconds in 4k

An option is also just rendering it one pass. Full size, no upscale

Or wait for LTX-2.1. I just saw this now, LTX is fixing portrait mode ;-)

image

An option is also just rendering it one pass. Full size, no upscale

Or wait for LTX-2.1. I just saw this now, LTX is fixing portrait mode ;-)

image

tried, it turns as black frames.
Great! So good they r fixing this problem ^^

tried, it turns as black frames.

I guess that goes to show. With 1080p (long length) I get black too.
LTX-2.1 it is ;-)

Sign up or log in to comment