A video-to-video lip-sync model based on Latent Diffusion, featuring custom code modifications and post-training on 3,000 hours of video data. Here are some examples of 512*512 version.