Spaces:
Running
Running
Add lip-sync app code and configuration
Browse files- README.md +54 -14
- app.py +183 -0
- requirements.txt +15 -0
README.md
CHANGED
|
@@ -1,14 +1,54 @@
|
|
| 1 |
-
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Lip-Sync Video Generator
|
| 2 |
+
|
| 3 |
+
This project contains a simple web application that allows you to upload a single frame (or short clip) of an avatar and an audio file and then produce a lip‑synchronised video. Internally the app uses the open‑source **Wav2Lip** model to animate the avatar so that the mouth movements follow the uploaded audio. Everything runs on a free cloud platform – there is no need to install anything locally.
|
| 4 |
+
|
| 5 |
+
## Features
|
| 6 |
+
|
| 7 |
+
- ✅ **Upload your own avatar:** any static image (PNG, JPG) or existing video clip can be used as the source face.
|
| 8 |
+
- ✅ **Upload an audio track:** accepts common audio formats (MP3, WAV, M4A) from 1 – 10 minutes long.
|
| 9 |
+
- ✅ **Fully automatic:** the application clones the Wav2Lip repository and downloads the necessary model weights on first run, so no manual setup is required.
|
| 10 |
+
- ✅ **Runs on free cloud hardware:** designed to be deployed to [Hugging Face Spaces](https://huggingface.co/spaces), which provide free CPU/GPU resources for open‑source projects. You only need a free Hugging Face account to host the app.
|
| 11 |
+
- ✅ **Extensible:** advanced users can tweak padding, super‑resolution and segmentation options by modifying the arguments passed to the inference script.
|
| 12 |
+
|
| 13 |
+
## How it works
|
| 14 |
+
|
| 15 |
+
When the “Generate video” button is pressed, the application performs the following steps:
|
| 16 |
+
|
| 17 |
+
1. If the Wav2Lip repository is not already present, it clones it from GitHub. The clone is shallow (only the latest commit) to minimise the download size.
|
| 18 |
+
2. If the pre‑trained weights (``wav2lip_gan.pth`` and the face segmentation model) are not present, it downloads them from publicly available sources. The downloads are large (~436 MB), so the first run can take some time.
|
| 19 |
+
3. The uploaded image/video and audio files are saved into a temporary folder. Basic validation ensures that the audio duration is between 1 and 10 minutes; otherwise an error is shown.
|
| 20 |
+
4. The application calls the official ``inference.py`` script from Wav2Lip in a subprocess. The script reads the avatar frame and audio, applies the lip‑sync model, and writes an MP4 video to the ``outputs`` directory.
|
| 21 |
+
5. Once the script completes, the resulting video is returned to the web UI and can be played or downloaded.
|
| 22 |
+
|
| 23 |
+
The heavy lifting is done by Wav2Lip; this project simply wraps it in a clean user interface with sensible defaults and handles all setup.
|
| 24 |
+
|
| 25 |
+
## Deploying to Hugging Face Spaces
|
| 26 |
+
|
| 27 |
+
1. **Create a free account:** go to [huggingface.co](https://huggingface.co) and create a free account if you don’t already have one.
|
| 28 |
+
2. **Create a new Space:** from your dashboard, click “New Space”, choose the **Gradio** SDK and give your space a name (e.g. ``lip-sync-app``). Set it to “Public” so that it can use free hardware.
|
| 29 |
+
3. **Upload the project files:** clone or download this repository, then upload the contents of the ``lipsync_app`` folder (``app.py``, ``requirements.txt`` and this ``README.md``) into your new space. The file structure should look like this:
|
| 30 |
+
|
| 31 |
+
```
|
| 32 |
+
├── app.py
|
| 33 |
+
├── requirements.txt
|
| 34 |
+
└── README.md
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
4. **Commit and build:** once the files are uploaded, commit them to the Space. Hugging Face will install the dependencies and build the application automatically. The first build may take a few minutes because it has to download the Wav2Lip model weights.
|
| 38 |
+
5. **Use the app:** after the build succeeds, open the “App” tab of your space. You can now upload an image and an audio file and click “Generate video” to produce a lip‑synchronised output. Longer audio clips (up to ten minutes) will take longer to process.
|
| 39 |
+
|
| 40 |
+
## Limitations
|
| 41 |
+
|
| 42 |
+
- **Large downloads on first run:** the Wav2Lip weights are hundreds of megabytes. The application caches them in the working directory so subsequent runs are faster. If the environment is reset (e.g. if the space times out), the weights will be downloaded again.
|
| 43 |
+
- **Processing time:** running Wav2Lip on CPU is slow (several minutes for a 1‑minute clip). For best performance, enable GPU hardware in your Hugging Face space settings. GPU hours are free on public spaces but limited, so plan accordingly.
|
| 44 |
+
- **Avatar quality:** Wav2Lip works best with clear, front‑facing images where the mouth is visible. Complex backgrounds or occlusions can degrade the result.
|
| 45 |
+
|
| 46 |
+
## Troubleshooting
|
| 47 |
+
|
| 48 |
+
- *Error cloning repository:* if cloning GitHub fails due to network restrictions, you can manually upload the Wav2Lip repository into the space. Download it locally from <https://github.com/Rudrabha/Wav2Lip> and add it as a folder named ``Wav2Lip`` next to ``app.py``.
|
| 49 |
+
- *Model download fails:* check the logs of your space for errors. You can upload the ``wav2lip_gan.pth`` and ``face_segmentation.pth`` files manually into a ``checkpoints`` folder to avoid automatic downloads.
|
| 50 |
+
- *App times out or runs out of memory:* try reducing the audio length or using CPU hardware. Free GPU instances provide limited memory (T4/8 GB), which may not handle extremely high resolution inputs.
|
| 51 |
+
|
| 52 |
+
## Acknowledgements
|
| 53 |
+
|
| 54 |
+
This project would not be possible without the original [Wav2Lip](https://github.com/Rudrabha/Wav2Lip) authors and their permissive MIT licence. It also makes use of [Gradio](https://gradio.app/) for the web interface.
|
app.py
ADDED
|
@@ -0,0 +1,183 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Gradio web application for generating lip‑synchronised videos from an avatar image
|
| 3 |
+
and an audio clip. On the first run it clones the Wav2Lip repository and
|
| 4 |
+
downloads the required model checkpoints. Subsequent runs reuse the cached
|
| 5 |
+
repository and weights. The heavy inference is executed by spawning the
|
| 6 |
+
original `inference.py` script provided by Wav2Lip.
|
| 7 |
+
|
| 8 |
+
To deploy on Hugging Face Spaces:
|
| 9 |
+
1. Create a new Gradio space.
|
| 10 |
+
2. Upload this file along with ``requirements.txt`` and ``README.md``.
|
| 11 |
+
3. Commit and wait for the space to build. Enable GPU hardware for best performance.
|
| 12 |
+
|
| 13 |
+
Author: ChatGPT (2025)
|
| 14 |
+
License: MIT (same as Wav2Lip)
|
| 15 |
+
"""
|
| 16 |
+
|
| 17 |
+
import os
|
| 18 |
+
import shutil
|
| 19 |
+
import subprocess
|
| 20 |
+
import tempfile
|
| 21 |
+
from pathlib import Path
|
| 22 |
+
|
| 23 |
+
import gradio as gr
|
| 24 |
+
import numpy as np
|
| 25 |
+
from pydub import AudioSegment
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
# Constants for repository and checkpoint names
|
| 29 |
+
REPO_URL = "https://github.com/Rudrabha/Wav2Lip.git"
|
| 30 |
+
REPO_DIR = Path("Wav2Lip")
|
| 31 |
+
CHECKPOINTS_DIR = REPO_DIR / "checkpoints"
|
| 32 |
+
WAV2LIP_MODEL = "wav2lip_gan.pth"
|
| 33 |
+
FACE_SEG_MODEL = "face_segmentation.pth"
|
| 34 |
+
|
| 35 |
+
# Direct download links for model weights. These links point to publicly
|
| 36 |
+
# accessible mirrors. If they stop working, you can upload the files into
|
| 37 |
+
# the checkpoints folder manually and the app will skip downloading.
|
| 38 |
+
MODEL_URLS = {
|
| 39 |
+
WAV2LIP_MODEL: "https://huggingface.co/Non-playing-Character/Wave2lip/resolve/main/wav2lip_gan.pth",
|
| 40 |
+
FACE_SEG_MODEL: "https://huggingface.co/Non-playing-Character/Wave2lip/resolve/main/face_segmentation.pth",
|
| 41 |
+
}
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def clone_repository() -> None:
|
| 45 |
+
"""Clone the Wav2Lip GitHub repository if it does not already exist."""
|
| 46 |
+
if REPO_DIR.exists():
|
| 47 |
+
return
|
| 48 |
+
# Use shallow clone to reduce download size
|
| 49 |
+
subprocess.run([
|
| 50 |
+
"git",
|
| 51 |
+
"clone",
|
| 52 |
+
"--depth",
|
| 53 |
+
"1",
|
| 54 |
+
REPO_URL,
|
| 55 |
+
str(REPO_DIR),
|
| 56 |
+
], check=True)
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
def download_model_weights() -> None:
|
| 60 |
+
"""Download the Wav2Lip and face segmentation checkpoints if missing."""
|
| 61 |
+
CHECKPOINTS_DIR.mkdir(parents=True, exist_ok=True)
|
| 62 |
+
for filename, url in MODEL_URLS.items():
|
| 63 |
+
dest = CHECKPOINTS_DIR / filename
|
| 64 |
+
if dest.exists():
|
| 65 |
+
continue
|
| 66 |
+
# Stream download using requests to avoid loading the full file into memory
|
| 67 |
+
import requests
|
| 68 |
+
with requests.get(url, stream=True) as r:
|
| 69 |
+
r.raise_for_status()
|
| 70 |
+
with open(dest, "wb") as f:
|
| 71 |
+
for chunk in r.iter_content(chunk_size=8192):
|
| 72 |
+
if chunk:
|
| 73 |
+
f.write(chunk)
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
def ensure_setup() -> None:
|
| 77 |
+
"""Ensure that the repository and model weights are available."""
|
| 78 |
+
clone_repository()
|
| 79 |
+
download_model_weights()
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
def validate_audio_length(audio_path: str) -> None:
|
| 83 |
+
"""Raise a ValueError if the audio duration is not between 1 and 10 minutes."""
|
| 84 |
+
audio = AudioSegment.from_file(audio_path)
|
| 85 |
+
duration_sec = len(audio) / 1000.0
|
| 86 |
+
if duration_sec < 60:
|
| 87 |
+
raise ValueError("Audio is too short: please provide at least 1 minute of audio.")
|
| 88 |
+
if duration_sec > 600:
|
| 89 |
+
raise ValueError("Audio is too long: please limit the audio to 10 minutes.")
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
def run_inference(image_path: Path, audio_path: Path) -> Path:
|
| 93 |
+
"""Run Wav2Lip inference and return the path to the generated video."""
|
| 94 |
+
ensure_setup()
|
| 95 |
+
|
| 96 |
+
# Prepare output directory and file name
|
| 97 |
+
outputs_dir = Path("outputs")
|
| 98 |
+
outputs_dir.mkdir(exist_ok=True)
|
| 99 |
+
output_path = outputs_dir / f"result_{image_path.stem}.mp4"
|
| 100 |
+
|
| 101 |
+
# Build command to run inference
|
| 102 |
+
cmd = [
|
| 103 |
+
"python", "inference.py",
|
| 104 |
+
"--checkpoint_path", str(CHECKPOINTS_DIR / WAV2LIP_MODEL),
|
| 105 |
+
"--segmentation_path", str(CHECKPOINTS_DIR / FACE_SEG_MODEL),
|
| 106 |
+
"--face", str(image_path),
|
| 107 |
+
"--audio", str(audio_path),
|
| 108 |
+
"--outfile", str(output_path),
|
| 109 |
+
"--pads", "0", "10", "0", "0", # default padding
|
| 110 |
+
]
|
| 111 |
+
# Execute inside repository directory
|
| 112 |
+
subprocess.run(cmd, cwd=str(REPO_DIR), check=True)
|
| 113 |
+
return output_path
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
def generate_video(avatar_file, audio_file):
|
| 117 |
+
"""
|
| 118 |
+
Gradio callback to generate a lip‑synced video.
|
| 119 |
+
|
| 120 |
+
Parameters
|
| 121 |
+
----------
|
| 122 |
+
avatar_file : file-like
|
| 123 |
+
Uploaded image or video containing the face.
|
| 124 |
+
audio_file : file-like
|
| 125 |
+
Uploaded audio file.
|
| 126 |
+
|
| 127 |
+
Returns
|
| 128 |
+
-------
|
| 129 |
+
str
|
| 130 |
+
Path to the generated MP4 file (relative to Gradio working directory).
|
| 131 |
+
"""
|
| 132 |
+
if avatar_file is None or audio_file is None:
|
| 133 |
+
return None
|
| 134 |
+
|
| 135 |
+
# Save uploaded files to temporary directory
|
| 136 |
+
with tempfile.TemporaryDirectory() as tmpdir:
|
| 137 |
+
avatar_path = Path(tmpdir) / "avatar"
|
| 138 |
+
audio_path = Path(tmpdir) / "audio"
|
| 139 |
+
# Write binary contents
|
| 140 |
+
avatar_path.write_bytes(avatar_file.read())
|
| 141 |
+
audio_path.write_bytes(audio_file.read())
|
| 142 |
+
# Validate audio length
|
| 143 |
+
try:
|
| 144 |
+
validate_audio_length(str(audio_path))
|
| 145 |
+
except Exception as e:
|
| 146 |
+
raise gr.Error(str(e))
|
| 147 |
+
# Run inference
|
| 148 |
+
try:
|
| 149 |
+
result_path = run_inference(avatar_path, audio_path)
|
| 150 |
+
except subprocess.CalledProcessError as e:
|
| 151 |
+
raise gr.Error(f"Inference failed: {e}")
|
| 152 |
+
return str(result_path)
|
| 153 |
+
|
| 154 |
+
|
| 155 |
+
def build_interface():
|
| 156 |
+
"""Construct the Gradio interface."""
|
| 157 |
+
with gr.Blocks(title="Lip‑Sync Video Generator") as demo:
|
| 158 |
+
gr.Markdown(
|
| 159 |
+
"""
|
| 160 |
+
# Lip‑Sync Video Generator
|
| 161 |
+
|
| 162 |
+
Upload an image (PNG/JPG) or short video of an avatar and an audio file (MP3/WAV/M4A) between 1 – 10 minutes. Click **Generate video** to create a new video where the avatar lip‑syncs to the audio. The first generation may take several minutes because the model needs to be downloaded.
|
| 163 |
+
"""
|
| 164 |
+
)
|
| 165 |
+
with gr.Row():
|
| 166 |
+
avatar_input = gr.File(label="Avatar image/video", file_count="single")
|
| 167 |
+
audio_input = gr.File(label="Audio (1–10 min)", file_count="single")
|
| 168 |
+
generate_btn = gr.Button("Generate video")
|
| 169 |
+
result_video = gr.Video(label="Output video", interactive=False)
|
| 170 |
+
|
| 171 |
+
generate_btn.click(
|
| 172 |
+
fn=generate_video,
|
| 173 |
+
inputs=[avatar_input, audio_input],
|
| 174 |
+
outputs=result_video,
|
| 175 |
+
show_progress=True,
|
| 176 |
+
)
|
| 177 |
+
return demo
|
| 178 |
+
|
| 179 |
+
|
| 180 |
+
if __name__ == "__main__":
|
| 181 |
+
demo = build_interface()
|
| 182 |
+
# Queue requests so multiple users can use the model concurrently
|
| 183 |
+
demo.queue().launch()
|
requirements.txt
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio==4.43.0
|
| 2 |
+
numpy
|
| 3 |
+
opencv-python
|
| 4 |
+
scipy
|
| 5 |
+
pydub
|
| 6 |
+
moviepy
|
| 7 |
+
torch
|
| 8 |
+
torchvision
|
| 9 |
+
gitpython
|
| 10 |
+
ffmpeg-python
|
| 11 |
+
tqdm
|
| 12 |
+
requests
|
| 13 |
+
gfpgan
|
| 14 |
+
realesrgan
|
| 15 |
+
basicsr
|