audio and image to video

#15
by lanranjun - opened

When I use LTX to generate videos, the results don't quite match the reference images. I'd say they are 80% similar, and they change every time I run it. How can I improve this? Especially when I need to generate multiple segments to create a complete video with a storyline.

What workflow are you using? You could try the first - last frame to see if that improve consistency
But yes, models do take a bit of freedom, its not a 100% look-alike always. But should be fairly good, at least in my experience.
If you are using a regular I2V workflow.

But if its a celebrity, or a person you know, you are far more likely to spot micro differences ;-)

LTX-2 - I2V Talking Avatar (voice clone Qwen-TTS).json
I use this workflow, and I want to use well-known movie stars to sing designated songs. It is obvious that the appearance of the stars in the video is only similar, not identical.
I made the same video using wan and ltx, and it's evident that wan meets the requirements while ltx is merely similar
wan:
https://www.bilibili.com/video/BV1juFTz8Eed/?spm_id_from=333.1387.homepage.video_card.click&vd_source=fc45d39e8b76813ac6bb03d5e41c4de8
ltx:
https://www.bilibili.com/video/BV15L6bBTELZ/?spm_id_from=333.1387.homepage.video_card.click&vd_source=fc45d39e8b76813ac6bb03d5e41c4de8

Owner

Yeah there were a slight difference there. I'll take a look at the talking avatar workflow if its anything. It could be that its a guider, not frame injector node. But pretty sure it I used imageInPlace.
But will check ;-) might just be the ltx-2 model though

my ltx model list:
ltx-2-19b-dev-fp8_transformer_only.safetensors
ltx-2-19b-distilled-lora_resized_dynamic_fro09_avg_rank_175_bf16.safetensors
ltx-2-19b-embeddings_connector_distill_bf16.safetensors
gemma_3_12B_it_fp8_e4m3fn.safetensors
LTX2_video_vae_bf16.safetensors
LTX2_audio_vae_bf16.safetensors
ltx-2-spatial-upscaler-x2-1.0.safetensors

The reason for using this model: ltx-2-19b-embeddings_connector_distill_bf16.safetensors is due to Kijai
The comment in the project reads: Rename text_encoders/ltx-2-19b-embeddings_connector_bf16.safetensors to text_encoders/ltx-2-19b-embeddings_connector_distill_bf16.safetensors

yes your models seems all correct ;-) I checked the workflow and its 1st frame injector ( LTXVImgToVideoInplace ) and at strength 1.0 (check if you have full strength also). The frame injector node is the strongest first frame node, so should work pretty well. At first glance all looks correct, maybe its just LTX-2 model that takes a bit of "freedom".

But will try render some celebrities here too, were its easier to see if some resemblance is lost ;-)

Yeah i guess thats 80-90% leonardo ;-)
Will try some here, if its possible to inject more frames to guide even stronger

Is it possible to add a node that retrieves face images from reference images and then magnifies the high-definition faces as a reference

Is it possible to add a node that retrieves face images from reference images and then magnifies the high-definition faces as a reference

not sure, I'll experiment a bit if anything makes it stronger. But might just be how LTX-2 is (currently).
LTX-2 team said they would soon release LTX-2.1 that would make I2V better, as well as portrait mode

One thing that is possible, is of course training a lora. But that takes a bit of time
https://docs.ltx.video/open-source-model/usage-guides/lo-ra

I encountered another issue: when running the same prompt multiple times, sometimes the character opens its mouth to speak, and sometimes it doesn't. Is there a good solution to this problem?

The model ltx-2-19b-dev-fp8_transformer_only.safetensors is relatively good, while the model ltx-2-19b-dev-fp8.safetensors has a higher probability of exhibiting issues such as not responding or even generating static images

I encountered another issue: when running the same prompt multiple times, sometimes the character opens its mouth to speak, and sometimes it doesn't. Is there a good solution to this problem?

Its a fairly common challenge, many have had troubles with that.. I usually prompt explicitly that the person talk.. something along => .... and then the woman talks, and the woman say : "....".
And not prompt as if you were generating a static image (less describe what you already see, but rather focus on sequential actions, what will happen). The LTX-2 prompting is a bit different.
https://ltx.io/model/model-blog/prompting-guide-for-ltx-2

And also this lora seems to help : https://huggingface.co/MachineDelusions/LTX-2_Image2Video_Adapter_LoRa/tree/main

When using Ltx to generate videos of real people, I found that setting the values of sigmas to 0.600, 0.580, 0.550, and 0.000 during the second sampling can ensure a high degree of consistency with the reference image. For non-real people, the original sigmas can be maintained.

When using Ltx to generate videos of real people, I found that setting the values of sigmas to 0.600, 0.580, 0.550, and 0.000 during the second sampling can ensure a high degree of consistency with the reference image. For non-real people, the original sigmas can be maintained.

interesting. Will try that too ;-)

notice ltx model list:
ltx-2-19b-dev-fp8_transformer_only.safetensors
ltx-2-19b-distilled-lora_resized_dynamic_fro09_avg_rank_175_bf16.safetensors
ltx-2-19b-embeddings_connector_distill_bf16.safetensors
gemma_3_12B_it_fp8_e4m3fn.safetensors
LTX2_video_vae_bf16.safetensors
LTX2_audio_vae_bf16.safetensors
ltx-2-spatial-upscaler-x2-1.0.safetensors

If the list of models mentioned above, especially the main model, is not used, the output video quality may be greasy

After you finish testing, could you please give me a reply? I want to know if those parameters are effective on your side

Did a quick test, seems ok, might be a little better yes, and for sure not worse ;-)
Was only one test though, will see with a few other start images if its more noticable.

Do you know how to disable subtitles

Do you know how to disable subtitles

they might appear randomly (although i never had them, i saw other users have sometimes). LTX themselves said that adding "subtitles" to the negative prompt could fix that.

I do wonder why I never have them though, maybe its also something in the prompting style. But thats just speculating.

One challenge with negative prompt is that its ignored with distilled model / distilled lora (cfg 1). So for it to have any impact you would need to add the LTX Nag node (or run full steps with DEV model, but thats quite slow). Or just try a different seed. That probably a quicker fix ;-)

Yes, I am currently using distilled model / distilled lora (cfg 1), so negative prompt does not take effect, Does your workflow utilize the LTX Nag node example

Additionally, I've noticed that videos produced using upscaling sampling are still not very clear. Have you encountered this issue? I'm not using detail lora, could this be the problem

Try a different seed, see if the subtitles are gone then. Probably the quickest fix.
Not had any issues with clarity after upscale, you could always try the detail lora (but as far as i understood it, its really meant for the restore old video workflow from them. But it might give more details in regular workflows as well, i havent tried that one )

The subtitle display issue has been resolved. Previously, to save time, the audio sampling did not use the extracted pure vocal track, which led to issues with the subtitle display when paired with the music video. This time, we used the extracted pure vocal track for sampling and finally synthesized the audio, replacing it with the complete audio. This solved the problem

has subtitle:
https://www.bilibili.com/video/BV1XscbzQE3r/?spm_id_from=333.1387.homepage.video_card.click&vd_source=fc45d39e8b76813ac6bb03d5e41c4de8

no subtitle:
https://www.bilibili.com/video/BV1NPcYzXEQx/?spm_id_from=333.1387.homepage.video_card.click&vd_source=fc45d39e8b76813ac6bb03d5e41c4de8

The video is in 720p, but the clarity is slightly worse when viewed on a computer. However, the clarity is acceptable on mobile devices.

Sign up or log in to comment