Improve model card

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +60 -34
README.md CHANGED
@@ -1,34 +1,60 @@
1
- <h1 align='center'>HighSync: High-Quality Lip Synchronization via
2
- Latent Diffusion Models</h1>
3
-
4
- <div align='center'>
5
- <a href='https://github.com/saeed5959' target='_blank'>Saeed Firouzi</a><sup>1</sup>&emsp;
6
- </div>
7
-
8
- <br>
9
-
10
- <div align='center'>
11
- <a href='https://github.com/saeed5959/high_sync'><img src='https://img.shields.io/badge/github-8da0cb?style=for-the-badge&labelColor=555555&logo=github'></a>
12
- <a href='https://arxiv.org/abs/2605.16918'><img src='https://img.shields.io/badge/Paper-Arxiv-red'></a>
13
- <a href='https://huggingface.co/datasets/saeed-5959/vfhq'><img src='https://img.shields.io/badge/Dataset-Hugging_Face-CFAFD4'></a>
14
- </div>
15
-
16
-
17
- ## Abstraction
18
- We present HighSync, an end-to-end diffusion-based
19
- framework for high-fidelity lip synchronization that generates
20
- photorealistic talking-face videos aligned with arbitrary input
21
- audio. Existing approaches consistently struggle to reconcile
22
- image quality with synchronization accuracy, producing either
23
- visually degraded outputs or temporally inconsistent lip move-
24
- ments. HighSync addresses both challenges simultaneously and,
25
- to our knowledge, is the first lip sync model to operate natively
26
- at 512×512 resolution, positioning it as a viable solution for
27
- professional production environments such as the film and broad-
28
- cast industries. Central to our approach is the identification and
29
- systematic elimination of a data leakage phenomenon that has
30
- silently undermined temporal modeling in prior work, preventing
31
- models from developing a genuine dependence on the audio
32
- signal. Comprehensive evaluations across both perceptual quality
33
- and synchronization accuracy metrics confirm that HighSync
34
- achieves state-of-the-art performance on both fronts.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: image-to-video
3
+ library_name: diffusers
4
+ ---
5
+
6
+ <h1 align='center'>HighSync: High-Quality Lip Synchronization via Latent Diffusion Models</h1>
7
+
8
+ HighSync is an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. It is the first lip sync model to operate natively at 512x512 resolution, positioning it as a viable solution for professional production environments.
9
+
10
+ - **Paper:** [HighSync: High-Quality Lip Synchronization via Latent Diffusion Models](https://huggingface.co/papers/2605.16918)
11
+ - **GitHub:** [saeed5959/high_sync](https://github.com/saeed5959/high_sync)
12
+
13
+ ## Abstract
14
+ We present HighSync, an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. Existing approaches consistently struggle to reconcile image quality with synchronization accuracy, producing either visually degraded outputs or temporally inconsistent lip movements. HighSync addresses both challenges simultaneously and, to our knowledge, is the first lip sync model to operate natively at 512x512 resolution. Central to our approach is the identification and systematic elimination of a data leakage phenomenon that has silently undermined temporal modeling in prior work, preventing models from developing a genuine dependence on the audio signal.
15
+
16
+ ## ⚒️ Installation
17
+
18
+ ### Environment
19
+ Ubuntu 20 or 22
20
+
21
+ ### Setup
22
+ ```bash
23
+ git clone https://github.com/saeed5959/high_sync
24
+ cd high_sync
25
+ pip install -r requirements.txt
26
+ apt-get install ffmpeg
27
+ ```
28
+
29
+ ### Download Pretrained Weights
30
+ ```bash
31
+ git lfs install
32
+ git clone https://huggingface.co/saeed-5959/high_sync pretrained_weights
33
+ ```
34
+
35
+ ## 🚀 Usage
36
+
37
+ First, convert your source video to 25 FPS:
38
+
39
+ ```bash
40
+ ffmpeg -i input.mp4 -r 25 out_25.mp4
41
+ ```
42
+
43
+ Then run the inference script:
44
+
45
+ ```bash
46
+ python -m inference --source_video "video_path.mp4" --driving_audio "audio_path.wav" --output "save_path.mp4"
47
+ ```
48
+
49
+ ## Citation
50
+ ```bibtex
51
+ @article{daghigh2024highsync,
52
+ title={HighSync: High-Quality Lip Synchronization via Latent Diffusion Models},
53
+ author={Saeed Firouzi Daghigh and Majid Iranpour Mobarekeh and Mostafa Alavi and Mehdi Bagheri},
54
+ journal={arXiv preprint arXiv:2605.16918},
55
+ year={2024}
56
+ }
57
+ ```
58
+
59
+ ## 🙏 Acknowledgements
60
+ This work is mainly based on [EchoMimic](https://github.com/antgroup/echomimic). We would also like to thank the contributors to the [AnimateDiff](https://github.com/guoyww/AnimateDiff), [Moore-AnimateAnyone](https://github.com/MooreThreads/Moore-AnimateAnyone), and [MuseTalk](https://github.com/TMElyralab/MuseTalk) repositories.