saeed-5959
/

high_sync

@@ -1,34 +1,60 @@
-<h1 align='center'>HighSync: High-Quality Lip Synchronization via
-Latent Diffusion Models</h1>
-<div align='center'>
-    <a href='https://github.com/saeed5959' target='_blank'>Saeed Firouzi</a><sup>1</sup>&emsp;
-</div>
-<br>
-<div align='center'>
-    <a href='https://github.com/saeed5959/high_sync'><img src='https://img.shields.io/badge/github-8da0cb?style=for-the-badge&labelColor=555555&logo=github'></a>
-    <a href='https://arxiv.org/abs/2605.16918'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
-    <a href='https://huggingface.co/datasets/saeed-5959/vfhq'><img src='https://img.shields.io/badge/Dataset-Hugging_Face-CFAFD4'></a>
-</div>
-## Abstraction
-We present HighSync, an end-to-end diffusion-based
-framework for high-fidelity lip synchronization that generates
-photorealistic talking-face videos aligned with arbitrary input
-audio. Existing approaches consistently struggle to reconcile
-image quality with synchronization accuracy, producing either
-visually degraded outputs or temporally inconsistent lip move-
-ments. HighSync addresses both challenges simultaneously and,
-to our knowledge, is the first lip sync model to operate natively
-at 512×512 resolution, positioning it as a viable solution for
-professional production environments such as the film and broad-
-cast industries. Central to our approach is the identification and
-systematic elimination of a data leakage phenomenon that has
-silently undermined temporal modeling in prior work, preventing
-models from developing a genuine dependence on the audio
-signal. Comprehensive evaluations across both perceptual quality
-and synchronization accuracy metrics confirm that HighSync
-achieves state-of-the-art performance on both fronts.

+---
+pipeline_tag: image-to-video
+library_name: diffusers
+---
+<h1 align='center'>HighSync: High-Quality Lip Synchronization via Latent Diffusion Models</h1>
+HighSync is an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. It is the first lip sync model to operate natively at 512x512 resolution, positioning it as a viable solution for professional production environments.
+- **Paper:** [HighSync: High-Quality Lip Synchronization via Latent Diffusion Models](https://huggingface.co/papers/2605.16918)
+- **GitHub:** [saeed5959/high_sync](https://github.com/saeed5959/high_sync)
+## Abstract
+We present HighSync, an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. Existing approaches consistently struggle to reconcile image quality with synchronization accuracy, producing either visually degraded outputs or temporally inconsistent lip movements. HighSync addresses both challenges simultaneously and, to our knowledge, is the first lip sync model to operate natively at 512x512 resolution. Central to our approach is the identification and systematic elimination of a data leakage phenomenon that has silently undermined temporal modeling in prior work, preventing models from developing a genuine dependence on the audio signal.
+## ⚒️ Installation
+### Environment
+Ubuntu 20 or 22
+### Setup
+```bash
+git clone https://github.com/saeed5959/high_sync
+cd high_sync
+pip install -r requirements.txt
+apt-get install ffmpeg
+```
+### Download Pretrained Weights
+```bash
+git lfs install
+git clone https://huggingface.co/saeed-5959/high_sync pretrained_weights
+```
+## 🚀 Usage
+First, convert your source video to 25 FPS:
+```bash
+ffmpeg -i input.mp4 -r 25 out_25.mp4
+```
+Then run the inference script:
+```bash
+python -m inference --source_video "video_path.mp4" --driving_audio "audio_path.wav" --output "save_path.mp4"
+```
+## Citation
+```bibtex
+@article{daghigh2024highsync,
+  title={HighSync: High-Quality Lip Synchronization via Latent Diffusion Models},
+  author={Saeed Firouzi Daghigh and Majid Iranpour Mobarekeh and Mostafa Alavi and Mehdi Bagheri},
+  journal={arXiv preprint arXiv:2605.16918},
+  year={2024}
+}
+```
+## 🙏 Acknowledgements
+This work is mainly based on [EchoMimic](https://github.com/antgroup/echomimic). We would also like to thank the contributors to the [AnimateDiff](https://github.com/guoyww/AnimateDiff), [Moore-AnimateAnyone](https://github.com/MooreThreads/Moore-AnimateAnyone), and [MuseTalk](https://github.com/TMElyralab/MuseTalk) repositories.