How to independently deploy a high-definition upscaling model while maintaining consistency with the original input in terms of image and audio？

#62

by mxteaw - opened Mar 26

Mar 26

Your workflows are very excellent, especially during deployment, I found that its HD capability is very strong. So now I want to create a separate workflow specifically for video HD upscaling. However, I am currently encountering problems, mainly:

The video I input contains both audio and video, but the output results in audio loss, and the new audio is not the original input dialogue, it sounds like it was generated separately.
The consistency of the generated scenes and characters is also compromised.
Do you know how to solve these problems?

RuneXX

Owner Mar 26

•

edited Mar 26

Happy you like ;-)

For the first one, the audio part. If you have your own audio input, connect this to the output video node (and ignore LTX audio). This is also how many other workflows are.
Let LTX do the visuals only, since the LTX audio is something "re-imagined" (and probably not the audio you want).

For the second part, I think perhaps thats one of the weakness of LTX. The model can do so many things (audio, video, controlnet, upscale, masking, etc etc).
But consistency over time is not its strongest part. And any upscale is adding new details, "re-doing" your video, so some changes are unavoidable.

(if you just want upscale video without any "re-imagining" or "creativity" added, something like SeedVR2 or FlashVSR is probably more suitable, its more true 1-1 upscale, and available as comfyui nodes)

What you can do though is add some guider nodes. LTX Guider. And set first frame guide to be from first frame of your input video. And last frame to be from last frame of your input video.
Or simpler, an LTX image-in-place (LTXimginplace) node with first frame as reference image.

There is also a "new node" that i find to work very well with consistency (at least new to me, i started to use it more and more). Called LTX Add latent guide.
Set this as -1 as index and encode your reference image (for example your video first frame) guiding latent input, so the model has a latent guide to work towards.

As with any guider node, add a LTX Crop Guides node at the very end of your workflow, before the vae decode. So that the guider "noise" is removed

And yes, its a little bit complicated perhaps. If you tell me exactly what you want the workflow to be, and what it should do, i can always try make one if you are stuck ;-)
or if you upload your workflow to pastebin or similar, i can take a look at it

mxteaw

Mar 26

Happy you like ;-)

For the first one, the audio part. If you have your own audio input, connect this to the output video node (and ignore LTX audio). This is also how many other workflows are.
Let LTX do the visuals only, since the LTX audio is something "re-imagined" (and probably not the audio you want).

For the second part, I think perhaps thats one of the weakness of LTX. The model can do so many things (audio, video, controlnet, upscale, masking, etc etc).
But consistency over time is not its strongest part. And any upscale is adding new details, "re-doing" your video, so some changes are unavoidable.

(if you just want upscale video without any "re-imagining" or "creativity" added, something like SeedVR2 or FlashVSR is probably more suitable, its more true 1-1 upscale, and available as comfyui nodes)

What you can do though is add some guider nodes. LTX Guider. And set first frame guide to be from first frame of your input video. And last frame to be from last frame of your input video.
Or simpler, an LTX image-in-place (LTXimginplace) node with first frame as reference image.

There is also a "new node" that i find to work very well with consistency (at least new to me, i started to use it more and more). Called LTX Add latent guide.
Set this as -1 as index and encode your reference image (for example your video first frame) guiding latent input, so the model has a latent guide to work towards.

As with any guider node, add a LTX Crop Guides node at the very end of your workflow, before the vae decode. So that the guider "noise" is removed

And yes, its a little bit complicated perhaps. If you tell me exactly what you want the workflow to be, and what it should do, i can always try make one if you are stuck ;-)
or if you upload your workflow to pastebin or similar, i can take a look at it

What I want to achieve is to use the ltx2.3 upscale model to double the resolution of a video that includes audio. After reading what you wrote, I realized my initial approach was too simplistic.
I have also tried connecting the original audio to the output video node, but in that case, the newly generated video and audio would have lip-sync errors. However, I’ve already solved the issue of chaotic voice generation: I used an audio-to-text customnodes(AV-FunASR) to convert the dialogue into text prompts to control the generation, and so far the results seem okay.
I will try the other methods you mentioned, thank you for your reply~

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment