Weird audio issue with "extend any video"

#37
by whatsthisaithing - opened

The "extend" workflow seemed to be working perfectly: action continues exactly as expected, decent adherence.

But on the first few attempts, any dialogue I prompted for was ignored, but there was at least still some audio/background noise that fit okay. I simplified my prompt a little to just focus on the additional dialogue, and now I'm getting metallic noise instead of anything useful on the extended segment. Very odd. Same loras/setup work perfectly fine with T2V/I2V, so it's not that directly.

I ran the model output from the feed forward lora (which is disabled, but no issue there) into a new power lora loader (no issue with that node otherwise), then ran the output from there into the Set Model node. Seems pretty straightforward.

Is there maybe a config difference for the audio on the extended workflow or something? Probably missing something dumb on my end.

Will take a look, I did notice it could benefit from a little more "overlapping frames" at the extend part.

The best guess is perhaps that you are using a user made lora? that is not trained on audio. Unfortunately many/most user made loras ruin the sound.
There are some lora loaders that can mute the lora audio part. I'll see if i can find one that works great, that is easy to use. I guess these user loras can be quite common.

Other than that, perhaps double check the audio vae, that its LTX-2.3 vae, and not LTX-2.0

But will take a look at the workflow, see if there is anything ..

Well, same loras (and same lora loader) have no issues producing great sound. Don't know if they're actually TRAINED on sound, but since 2.3 dropped, all of the loras I use work great audio-wise. Will poke and see if I can spot a dumb mistake on my part.

Just to add: the extended clip doesn't even try to say the lines (it's not just the audio that fails). It's like any additional dialogue is just straight ignored. I did disable the prompt enhancer. Maybe something extra got tweaked there?

not an audio issue but: why are you hardcoding the height and width instead of scaling from the input video? this causes issues when the ratio is off. and changing the resolution always leads to a mismatch:

File "\ComfyUI_portable\ComfyUI\custom_nodes\ComfyUI-KJNodes\nodes\image_nodes.py", line 1954, in imagesfrombatch
raise ValueError(f"Source and new images must have the same shape: {source_images.shape[1:3]} vs {new_images.shape[1:3]}")
ValueError: Source and new images must have the same shape: torch.Size([1088, 928]) vs torch.Size([1088, 896])

not an audio issue but: why are you hardcoding the height and width instead of scaling from the input video? this causes issues when the ratio is off. and changing the resolution always leads to a mismatch:

yes good idea. Will add that so it fits better with the input video size

i'm getting the issue now even when leaving it at 1280x720. def needs to be dynamic based around the actual video πŸ‘

Will take a look. It was a bit of a "proof-of-concept" rushed a little bit since someone asked for it in a thread, perhaps something got overlooked ;-)

Plus add some quality improvements to it

Just to add: the extended clip doesn't even try to say the lines It's like any additional dialogue is just straight ignored

This is sometimes a bit tricky with LTX, but it comes down to prompting usually. You often end up with a narrator voice, but if you carefully tell LTX who is talking its less likely to do narrator.

For example:
Then the man at the left with blue shirt talks with a deep rugged voice, and he says: "Can you hear me now".
With perfect lip-sync to the spoken words

As you can see, its explicitly said who is talking, and dialogs are within quotation marks "...".
And be elaborate. Can even add what kind of accent he has etc.

You can also try time stamps

08-12 seconds: The man talks, and he says: "..."

I fixed the sound issue this way:
2026-03-17-14-18-10

2026-03-17-14-30-39

Yes i noticed that too, i think some error snuck in, checking

Fixed ;-) was a little bug with the routing of the guiders. Sorry about that.
Uploaded new version

The new version also have several improvements and should work much better

LTX also updated their upscale model, so it should also help improve things (especially in regards to sudden text on screen etc)
https://huggingface.co/Lightricks/LTX-2.3/tree/main
Get the model: ltx-2.3-spatial-upscaler-x2-1.1.safetensors

Yep, that fixed it. Seamless blend of the original + extension, audio works great. Excellent work!

Sign up or log in to comment