Lipsync issues for vertical videos

#10
by pashka00 - opened

Hello!
Thank you for sharing the workflow — overall, it works very well.

I’ve encountered an issue with lip sync when using vertical videos. When the input images are in landscape orientation (width greater than height), lip synchronization with the audio works perfect. However, for vertical images (height greater than width), lip sync does not work — mouth movements are either missing or not synchronized with the audio.
I am attaching a video example where lip sync does not work for a vertical input.

The only case where I managed to get lip sync working with vertical video was by applying CameraControl, as described in this discussion:
https://huggingface.co/RuneXX/LTX-2-Workflows/discussions/5
However, when using this approach, a noticeable grid / checkerboard artifact appears (see the attached video). This artifact seems to be related to the distillations and optimizations in LTX-2, which negatively affect generation quality.

I also tried using models with fewer optimizations, both with and without CameraControl. Unfortunately, lip sync does not work for vertical videos for all of them;

Could you please clarify what might be causing lip sync to fail specifically for vertical videos and whether there is a recommended solution or workflow that preserves both lip sync accuracy and visual quality?

Thank you in advance for your help and for your work on this project!

It should not be any difference between vertical and horizontal.. its exactly the same workflow.
But will take a look.

One thing that comes to mind is prompting. Since LTX-2 can have a narrative voice over any video, you have to explicitly say that the person in the video is talking in your prompt.
Start the prompt with : 'A man in business suit is looking at the camera, he talks and say : "What specific questions...."'
(and prompt less what you see, its not an image generation model, so focus on the prompting like a film director: describe the actions, the dialog, what is to happen, and in what sequence)

Probably even more so when its a female voice, and the first thing the model might assume is that you want a narrative voice ... Try that ;-)
i'll also look at the workflow if there is anything

I did struggle a bit here too, but with random seed, and prompting it worked.
Try set random noise to randomize in "Sampler - first pass", else you will pretty much get same result each run (although prompting will change results a little bit, randomize seed makes each run be "different")

image

My prompt:
"A cinematic video of man in business suit sitting down with a news paper. The talks and speaks directly towards the camera.
The man in the business suit speaks light female voice he says "what specific questions.." .... in perfect lip-sync to the attached audio.
In the background people in the office are doing their work. "

That being said, with a male voice input, the model struggled far less ;-) so that might be part of it
But yes, I have had images that are challenging, while other image input works easily - so even the image input could be of a nature that "invite" to a narrative voice

Also the lip-sync movements seems to work better when its a matching male voice

Prompt:
"A cinematic video of man in business suit sitting down with a news paper. The talks and speaks directly towards the camera.
The man speaks in a deep voice voice and says "ezekiel 24 17 .." ... with the rest of the dialog in perfect lip-sync to the audio.
In the background people in the office are doing their work"

Sign up or log in to comment