Spaces:
Running
Running
Add fallback lip-sync algorithm using amplitude-driven mouth animation and update README accordingly
Browse files
README.md
CHANGED
|
@@ -17,6 +17,7 @@ This project contains a simple web application that allows you to upload a singl
|
|
| 17 |
- **Upload your own avatar:** any static image (PNG/JPG) or short video clip can be used as the source face.
|
| 18 |
- **Upload an audio track:** accepts common audio formats (MP3/WAV/M4A) from 1–10 minutes long.
|
| 19 |
- **Self‑contained setup:** on first use the application extracts the Wav2Lip source code from a zip archive (if present) and verifies that the model weights exist. If the environment allows outbound downloads, it will fetch the weights automatically; otherwise you can provide them manually. No local installation is required.
|
|
|
|
| 20 |
- **Runs on free cloud hardware:** designed for deployment on [Hugging Face Spaces](https://huggingface.co/spaces), which provide free CPU/GPU resources for public projects.
|
| 21 |
- **Extensible:** advanced users can tweak padding, segmentation and other options by modifying the inference arguments in ``app.py``.
|
| 22 |
|
|
@@ -27,8 +28,8 @@ When the **Generate video** button is pressed, the application performs the foll
|
|
| 27 |
1. If the ``Wav2Lip`` folder is not present, it tries to extract it from a local zip archive named ``Wav2Lip-master.zip``. If the archive isn’t found it attempts a shallow clone from GitHub. On network‑restricted environments you should upload the archive yourself (see **Deploying to Spaces**).
|
| 28 |
2. If the pre‑trained weights (``wav2lip_gan.pth`` and the face segmentation model) are not present, it attempts to download them from publicly available mirrors. These files are large (~436 MB and ~53 MB respectively). If the download fails, you can upload the files manually into the ``Wav2Lip/checkpoints`` folder.
|
| 29 |
3. The uploaded image/video and audio files are saved into a temporary folder. Basic validation ensures that the audio duration is between 1 and 10 minutes; otherwise an error is shown.
|
| 30 |
-
4. The application calls the official ``inference.py`` script from Wav2Lip in a subprocess. The script reads the avatar frame and audio, applies the lip‑sync model, and writes an MP4 video to the ``outputs`` directory.
|
| 31 |
-
5. Once
|
| 32 |
|
| 33 |
The heavy lifting is done by Wav2Lip; this project simply wraps it in a clean user interface with sensible defaults and handles all setup.
|
| 34 |
|
|
|
|
| 17 |
- **Upload your own avatar:** any static image (PNG/JPG) or short video clip can be used as the source face.
|
| 18 |
- **Upload an audio track:** accepts common audio formats (MP3/WAV/M4A) from 1–10 minutes long.
|
| 19 |
- **Self‑contained setup:** on first use the application extracts the Wav2Lip source code from a zip archive (if present) and verifies that the model weights exist. If the environment allows outbound downloads, it will fetch the weights automatically; otherwise you can provide them manually. No local installation is required.
|
| 20 |
+
- **Offline fallback:** if neither the repository nor the weights are available (for example on locked‑down networks where large downloads are forbidden), the app gracefully falls back to a lightweight amplitude‑based animation. It will still produce a talking head by stretching and squashing the mouth region in sync with the loudness of the audio. This effect is simpler than full Wav2Lip but ensures you always get a video out.
|
| 21 |
- **Runs on free cloud hardware:** designed for deployment on [Hugging Face Spaces](https://huggingface.co/spaces), which provide free CPU/GPU resources for public projects.
|
| 22 |
- **Extensible:** advanced users can tweak padding, segmentation and other options by modifying the inference arguments in ``app.py``.
|
| 23 |
|
|
|
|
| 28 |
1. If the ``Wav2Lip`` folder is not present, it tries to extract it from a local zip archive named ``Wav2Lip-master.zip``. If the archive isn’t found it attempts a shallow clone from GitHub. On network‑restricted environments you should upload the archive yourself (see **Deploying to Spaces**).
|
| 29 |
2. If the pre‑trained weights (``wav2lip_gan.pth`` and the face segmentation model) are not present, it attempts to download them from publicly available mirrors. These files are large (~436 MB and ~53 MB respectively). If the download fails, you can upload the files manually into the ``Wav2Lip/checkpoints`` folder.
|
| 30 |
3. The uploaded image/video and audio files are saved into a temporary folder. Basic validation ensures that the audio duration is between 1 and 10 minutes; otherwise an error is shown.
|
| 31 |
+
4. The application calls the official ``inference.py`` script from Wav2Lip in a subprocess. The script reads the avatar frame and audio, applies the lip‑sync model, and writes an MP4 video to the ``outputs`` directory. If this step fails because the repository or weights are missing, the app automatically switches to a basic fallback: it computes the loudness of the audio and stretches the mouth area of the avatar up and down to create a rudimentary talking animation.
|
| 32 |
+
5. Once video generation completes (either via Wav2Lip or the fallback), the resulting MP4 is returned to the web UI and can be played or downloaded.
|
| 33 |
|
| 34 |
The heavy lifting is done by Wav2Lip; this project simply wraps it in a clean user interface with sensible defaults and handles all setup.
|
| 35 |
|
app.py
CHANGED
|
@@ -179,26 +179,148 @@ def validate_audio_length(audio_path: str) -> None:
|
|
| 179 |
|
| 180 |
|
| 181 |
def run_inference(image_path: Path, audio_path: Path) -> Path:
|
| 182 |
-
"""
|
| 183 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 186 |
outputs_dir = Path("outputs")
|
| 187 |
outputs_dir.mkdir(exist_ok=True)
|
| 188 |
-
output_path = outputs_dir / f"
|
| 189 |
-
|
| 190 |
-
#
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
"--pads", "0", "10", "0", "0", # default padding
|
| 199 |
-
]
|
| 200 |
-
# Execute inside repository directory
|
| 201 |
-
subprocess.run(cmd, cwd=str(REPO_DIR), check=True)
|
| 202 |
return output_path
|
| 203 |
|
| 204 |
|
|
|
|
| 179 |
|
| 180 |
|
| 181 |
def run_inference(image_path: Path, audio_path: Path) -> Path:
|
| 182 |
+
"""
|
| 183 |
+
Generate a lip‑synced video from an avatar image and audio track.
|
| 184 |
+
|
| 185 |
+
This function attempts to perform high‑quality lip synchronisation via the
|
| 186 |
+
Wav2Lip model. If the required model repository or weights are not
|
| 187 |
+
available (for example because outbound network traffic is blocked or the
|
| 188 |
+
weight files are too large to download), it falls back to a lightweight
|
| 189 |
+
amplitude‑driven mouth animation. The fallback uses only OpenCV and
|
| 190 |
+
MoviePy to create a simple talking head effect by stretching the mouth
|
| 191 |
+
region based on the loudness of the audio. Although not as accurate as
|
| 192 |
+
Wav2Lip, the fallback produces a plausible talking animation without
|
| 193 |
+
requiring any deep learning checkpoints.
|
| 194 |
+
|
| 195 |
+
Parameters
|
| 196 |
+
----------
|
| 197 |
+
image_path : Path
|
| 198 |
+
Path to the avatar image saved on disk.
|
| 199 |
+
audio_path : Path
|
| 200 |
+
Path to the audio file saved on disk.
|
| 201 |
+
|
| 202 |
+
Returns
|
| 203 |
+
-------
|
| 204 |
+
Path
|
| 205 |
+
Path to the generated MP4 video relative to the working directory.
|
| 206 |
+
"""
|
| 207 |
+
# First attempt the full Wav2Lip pipeline. If anything fails (e.g. missing
|
| 208 |
+
# repository or weights, runtime errors from the inference script), we
|
| 209 |
+
# swallow the error and fall back to the simple implementation.
|
| 210 |
+
try:
|
| 211 |
+
ensure_setup()
|
| 212 |
+
outputs_dir = Path("outputs")
|
| 213 |
+
outputs_dir.mkdir(exist_ok=True)
|
| 214 |
+
output_path = outputs_dir / f"result_{image_path.stem}.mp4"
|
| 215 |
+
cmd = [
|
| 216 |
+
"python", "inference.py",
|
| 217 |
+
"--checkpoint_path", str(CHECKPOINTS_DIR / WAV2LIP_MODEL),
|
| 218 |
+
"--segmentation_path", str(CHECKPOINTS_DIR / FACE_SEG_MODEL),
|
| 219 |
+
"--face", str(image_path),
|
| 220 |
+
"--audio", str(audio_path),
|
| 221 |
+
"--outfile", str(output_path),
|
| 222 |
+
"--pads", "0", "10", "0", "0",
|
| 223 |
+
]
|
| 224 |
+
subprocess.run(cmd, cwd=str(REPO_DIR), check=True)
|
| 225 |
+
return output_path
|
| 226 |
+
except Exception:
|
| 227 |
+
# Fall back to simple lip‑sync implementation
|
| 228 |
+
return simple_lip_sync(image_path, audio_path)
|
| 229 |
+
|
| 230 |
+
|
| 231 |
+
def simple_lip_sync(image_path: Path, audio_path: Path, fps: int = 25) -> Path:
|
| 232 |
+
"""
|
| 233 |
+
Create a basic talking head animation without neural networks.
|
| 234 |
+
|
| 235 |
+
The fallback algorithm estimates speech activity from the audio's RMS
|
| 236 |
+
amplitude and animates the avatar by vertically scaling the mouth region
|
| 237 |
+
accordingly. The mouth is approximated as a box located in the lower
|
| 238 |
+
portion of the image. Each frame is generated by resizing this region
|
| 239 |
+
based on the normalised amplitude for that time slice. The resulting
|
| 240 |
+
frames are compiled into a video using MoviePy and the original audio is
|
| 241 |
+
attached.
|
| 242 |
+
|
| 243 |
+
Parameters
|
| 244 |
+
----------
|
| 245 |
+
image_path : Path
|
| 246 |
+
Path to the input image.
|
| 247 |
+
audio_path : Path
|
| 248 |
+
Path to the input audio file.
|
| 249 |
+
fps : int, optional
|
| 250 |
+
Frames per second for the output video, by default 25.
|
| 251 |
|
| 252 |
+
Returns
|
| 253 |
+
-------
|
| 254 |
+
Path
|
| 255 |
+
Path to the generated video file.
|
| 256 |
+
"""
|
| 257 |
+
import cv2 # imported here to avoid mandatory dependency for users who provide Wav2Lip models
|
| 258 |
+
import moviepy.editor as mpy
|
| 259 |
+
|
| 260 |
+
# Load avatar image (BGR)
|
| 261 |
+
img = cv2.imread(str(image_path))
|
| 262 |
+
if img is None:
|
| 263 |
+
raise RuntimeError("Failed to load the avatar image. Please ensure the file is a valid image.")
|
| 264 |
+
height, width, _ = img.shape
|
| 265 |
+
# Approximate mouth bounding box (tune proportions if necessary)
|
| 266 |
+
mouth_w = int(width * 0.6)
|
| 267 |
+
mouth_h = int(height * 0.15)
|
| 268 |
+
mouth_x = int(width * 0.2)
|
| 269 |
+
mouth_y = int(height * 0.65)
|
| 270 |
+
|
| 271 |
+
# Load audio and compute amplitude per frame
|
| 272 |
+
audio = AudioSegment.from_file(str(audio_path))
|
| 273 |
+
samples = np.array(audio.get_array_of_samples()).astype(np.float32)
|
| 274 |
+
# Stereo to mono if necessary
|
| 275 |
+
if audio.channels > 1:
|
| 276 |
+
samples = samples.reshape((-1, audio.channels)).mean(axis=1)
|
| 277 |
+
frame_size = int(audio.frame_rate / fps)
|
| 278 |
+
n_frames = max(int(len(samples) / frame_size), 1)
|
| 279 |
+
amplitudes = []
|
| 280 |
+
for i in range(n_frames):
|
| 281 |
+
segment = samples[i * frame_size : (i + 1) * frame_size]
|
| 282 |
+
if segment.size == 0:
|
| 283 |
+
amp = 0.0
|
| 284 |
+
else:
|
| 285 |
+
# Root mean square of the audio segment
|
| 286 |
+
amp = float(np.sqrt(np.mean(segment ** 2)))
|
| 287 |
+
amplitudes.append(amp)
|
| 288 |
+
max_amp = max(amplitudes) if amplitudes else 1.0
|
| 289 |
+
if max_amp == 0:
|
| 290 |
+
max_amp = 1.0
|
| 291 |
+
amplitudes = [amp / max_amp for amp in amplitudes]
|
| 292 |
+
|
| 293 |
+
frames = []
|
| 294 |
+
for amp in amplitudes:
|
| 295 |
+
# Compute scaling factor between 1.0 (mouth closed) and 1.6 (fully open)
|
| 296 |
+
factor = 1.0 + amp * 0.6
|
| 297 |
+
frame_bgr = img.copy()
|
| 298 |
+
# Extract mouth ROI
|
| 299 |
+
roi = frame_bgr[mouth_y : mouth_y + mouth_h, mouth_x : mouth_x + mouth_w]
|
| 300 |
+
# Scale ROI vertically
|
| 301 |
+
new_h = max(1, int(mouth_h * factor))
|
| 302 |
+
scaled = cv2.resize(roi, (mouth_w, new_h), interpolation=cv2.INTER_LINEAR)
|
| 303 |
+
# Determine overlay region bounds (ensure we don't write outside image)
|
| 304 |
+
end_y = min(height, mouth_y + new_h)
|
| 305 |
+
overlay = scaled[: end_y - mouth_y, :, :]
|
| 306 |
+
frame_bgr[mouth_y:end_y, mouth_x : mouth_x + mouth_w] = overlay
|
| 307 |
+
# Convert to RGB for MoviePy
|
| 308 |
+
frame_rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
|
| 309 |
+
frames.append(frame_rgb)
|
| 310 |
+
|
| 311 |
+
# Use MoviePy to assemble the video and attach audio
|
| 312 |
outputs_dir = Path("outputs")
|
| 313 |
outputs_dir.mkdir(exist_ok=True)
|
| 314 |
+
output_path = outputs_dir / f"simple_{image_path.stem}.mp4"
|
| 315 |
+
clip = mpy.ImageSequenceClip(frames, fps=fps)
|
| 316 |
+
# Attach audio
|
| 317 |
+
audio_clip = mpy.AudioFileClip(str(audio_path))
|
| 318 |
+
# Trim audio to match video length if necessary
|
| 319 |
+
min_duration = min(clip.duration, audio_clip.duration)
|
| 320 |
+
clip = clip.set_audio(audio_clip.subclip(0, min_duration))
|
| 321 |
+
clip = clip.set_duration(min_duration)
|
| 322 |
+
# Write out using H.264 codec and AAC audio. Use preset ultrafast to reduce CPU usage.
|
| 323 |
+
clip.write_videofile(str(output_path), codec="libx264", audio_codec="aac", fps=fps, preset="ultrafast")
|
|
|
|
|
|
|
|
|
|
|
|
|
| 324 |
return output_path
|
| 325 |
|
| 326 |
|