gyrus2 commited on
Commit
fcb16b9
·
verified ·
1 Parent(s): 66421d1

Add fallback lip-sync algorithm using amplitude-driven mouth animation and update README accordingly

Browse files
Files changed (2) hide show
  1. README.md +3 -2
  2. app.py +139 -17
README.md CHANGED
@@ -17,6 +17,7 @@ This project contains a simple web application that allows you to upload a singl
17
  - **Upload your own avatar:** any static image (PNG/JPG) or short video clip can be used as the source face.
18
  - **Upload an audio track:** accepts common audio formats (MP3/WAV/M4A) from 1–10 minutes long.
19
  - **Self‑contained setup:** on first use the application extracts the Wav2Lip source code from a zip archive (if present) and verifies that the model weights exist. If the environment allows outbound downloads, it will fetch the weights automatically; otherwise you can provide them manually. No local installation is required.
 
20
  - **Runs on free cloud hardware:** designed for deployment on [Hugging Face Spaces](https://huggingface.co/spaces), which provide free CPU/GPU resources for public projects.
21
  - **Extensible:** advanced users can tweak padding, segmentation and other options by modifying the inference arguments in ``app.py``.
22
 
@@ -27,8 +28,8 @@ When the **Generate video** button is pressed, the application performs the foll
27
  1. If the ``Wav2Lip`` folder is not present, it tries to extract it from a local zip archive named ``Wav2Lip-master.zip``. If the archive isn’t found it attempts a shallow clone from GitHub. On network‑restricted environments you should upload the archive yourself (see **Deploying to Spaces**).
28
  2. If the pre‑trained weights (``wav2lip_gan.pth`` and the face segmentation model) are not present, it attempts to download them from publicly available mirrors. These files are large (~436 MB and ~53 MB respectively). If the download fails, you can upload the files manually into the ``Wav2Lip/checkpoints`` folder.
29
  3. The uploaded image/video and audio files are saved into a temporary folder. Basic validation ensures that the audio duration is between 1 and 10 minutes; otherwise an error is shown.
30
- 4. The application calls the official ``inference.py`` script from Wav2Lip in a subprocess. The script reads the avatar frame and audio, applies the lip‑sync model, and writes an MP4 video to the ``outputs`` directory.
31
- 5. Once the script completes, the resulting video is returned to the web UI and can be played or downloaded.
32
 
33
  The heavy lifting is done by Wav2Lip; this project simply wraps it in a clean user interface with sensible defaults and handles all setup.
34
 
 
17
  - **Upload your own avatar:** any static image (PNG/JPG) or short video clip can be used as the source face.
18
  - **Upload an audio track:** accepts common audio formats (MP3/WAV/M4A) from 1–10 minutes long.
19
  - **Self‑contained setup:** on first use the application extracts the Wav2Lip source code from a zip archive (if present) and verifies that the model weights exist. If the environment allows outbound downloads, it will fetch the weights automatically; otherwise you can provide them manually. No local installation is required.
20
+ - **Offline fallback:** if neither the repository nor the weights are available (for example on locked‑down networks where large downloads are forbidden), the app gracefully falls back to a lightweight amplitude‑based animation. It will still produce a talking head by stretching and squashing the mouth region in sync with the loudness of the audio. This effect is simpler than full Wav2Lip but ensures you always get a video out.
21
  - **Runs on free cloud hardware:** designed for deployment on [Hugging Face Spaces](https://huggingface.co/spaces), which provide free CPU/GPU resources for public projects.
22
  - **Extensible:** advanced users can tweak padding, segmentation and other options by modifying the inference arguments in ``app.py``.
23
 
 
28
  1. If the ``Wav2Lip`` folder is not present, it tries to extract it from a local zip archive named ``Wav2Lip-master.zip``. If the archive isn’t found it attempts a shallow clone from GitHub. On network‑restricted environments you should upload the archive yourself (see **Deploying to Spaces**).
29
  2. If the pre‑trained weights (``wav2lip_gan.pth`` and the face segmentation model) are not present, it attempts to download them from publicly available mirrors. These files are large (~436 MB and ~53 MB respectively). If the download fails, you can upload the files manually into the ``Wav2Lip/checkpoints`` folder.
30
  3. The uploaded image/video and audio files are saved into a temporary folder. Basic validation ensures that the audio duration is between 1 and 10 minutes; otherwise an error is shown.
31
+ 4. The application calls the official ``inference.py`` script from Wav2Lip in a subprocess. The script reads the avatar frame and audio, applies the lip‑sync model, and writes an MP4 video to the ``outputs`` directory. If this step fails because the repository or weights are missing, the app automatically switches to a basic fallback: it computes the loudness of the audio and stretches the mouth area of the avatar up and down to create a rudimentary talking animation.
32
+ 5. Once video generation completes (either via Wav2Lip or the fallback), the resulting MP4 is returned to the web UI and can be played or downloaded.
33
 
34
  The heavy lifting is done by Wav2Lip; this project simply wraps it in a clean user interface with sensible defaults and handles all setup.
35
 
app.py CHANGED
@@ -179,26 +179,148 @@ def validate_audio_length(audio_path: str) -> None:
179
 
180
 
181
  def run_inference(image_path: Path, audio_path: Path) -> Path:
182
- """Run Wav2Lip inference and return the path to the generated video."""
183
- ensure_setup()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
184
 
185
- # Prepare output directory and file name
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
186
  outputs_dir = Path("outputs")
187
  outputs_dir.mkdir(exist_ok=True)
188
- output_path = outputs_dir / f"result_{image_path.stem}.mp4"
189
-
190
- # Build command to run inference
191
- cmd = [
192
- "python", "inference.py",
193
- "--checkpoint_path", str(CHECKPOINTS_DIR / WAV2LIP_MODEL),
194
- "--segmentation_path", str(CHECKPOINTS_DIR / FACE_SEG_MODEL),
195
- "--face", str(image_path),
196
- "--audio", str(audio_path),
197
- "--outfile", str(output_path),
198
- "--pads", "0", "10", "0", "0", # default padding
199
- ]
200
- # Execute inside repository directory
201
- subprocess.run(cmd, cwd=str(REPO_DIR), check=True)
202
  return output_path
203
 
204
 
 
179
 
180
 
181
  def run_inference(image_path: Path, audio_path: Path) -> Path:
182
+ """
183
+ Generate a lip‑synced video from an avatar image and audio track.
184
+
185
+ This function attempts to perform high‑quality lip synchronisation via the
186
+ Wav2Lip model. If the required model repository or weights are not
187
+ available (for example because outbound network traffic is blocked or the
188
+ weight files are too large to download), it falls back to a lightweight
189
+ amplitude‑driven mouth animation. The fallback uses only OpenCV and
190
+ MoviePy to create a simple talking head effect by stretching the mouth
191
+ region based on the loudness of the audio. Although not as accurate as
192
+ Wav2Lip, the fallback produces a plausible talking animation without
193
+ requiring any deep learning checkpoints.
194
+
195
+ Parameters
196
+ ----------
197
+ image_path : Path
198
+ Path to the avatar image saved on disk.
199
+ audio_path : Path
200
+ Path to the audio file saved on disk.
201
+
202
+ Returns
203
+ -------
204
+ Path
205
+ Path to the generated MP4 video relative to the working directory.
206
+ """
207
+ # First attempt the full Wav2Lip pipeline. If anything fails (e.g. missing
208
+ # repository or weights, runtime errors from the inference script), we
209
+ # swallow the error and fall back to the simple implementation.
210
+ try:
211
+ ensure_setup()
212
+ outputs_dir = Path("outputs")
213
+ outputs_dir.mkdir(exist_ok=True)
214
+ output_path = outputs_dir / f"result_{image_path.stem}.mp4"
215
+ cmd = [
216
+ "python", "inference.py",
217
+ "--checkpoint_path", str(CHECKPOINTS_DIR / WAV2LIP_MODEL),
218
+ "--segmentation_path", str(CHECKPOINTS_DIR / FACE_SEG_MODEL),
219
+ "--face", str(image_path),
220
+ "--audio", str(audio_path),
221
+ "--outfile", str(output_path),
222
+ "--pads", "0", "10", "0", "0",
223
+ ]
224
+ subprocess.run(cmd, cwd=str(REPO_DIR), check=True)
225
+ return output_path
226
+ except Exception:
227
+ # Fall back to simple lip‑sync implementation
228
+ return simple_lip_sync(image_path, audio_path)
229
+
230
+
231
+ def simple_lip_sync(image_path: Path, audio_path: Path, fps: int = 25) -> Path:
232
+ """
233
+ Create a basic talking head animation without neural networks.
234
+
235
+ The fallback algorithm estimates speech activity from the audio's RMS
236
+ amplitude and animates the avatar by vertically scaling the mouth region
237
+ accordingly. The mouth is approximated as a box located in the lower
238
+ portion of the image. Each frame is generated by resizing this region
239
+ based on the normalised amplitude for that time slice. The resulting
240
+ frames are compiled into a video using MoviePy and the original audio is
241
+ attached.
242
+
243
+ Parameters
244
+ ----------
245
+ image_path : Path
246
+ Path to the input image.
247
+ audio_path : Path
248
+ Path to the input audio file.
249
+ fps : int, optional
250
+ Frames per second for the output video, by default 25.
251
 
252
+ Returns
253
+ -------
254
+ Path
255
+ Path to the generated video file.
256
+ """
257
+ import cv2 # imported here to avoid mandatory dependency for users who provide Wav2Lip models
258
+ import moviepy.editor as mpy
259
+
260
+ # Load avatar image (BGR)
261
+ img = cv2.imread(str(image_path))
262
+ if img is None:
263
+ raise RuntimeError("Failed to load the avatar image. Please ensure the file is a valid image.")
264
+ height, width, _ = img.shape
265
+ # Approximate mouth bounding box (tune proportions if necessary)
266
+ mouth_w = int(width * 0.6)
267
+ mouth_h = int(height * 0.15)
268
+ mouth_x = int(width * 0.2)
269
+ mouth_y = int(height * 0.65)
270
+
271
+ # Load audio and compute amplitude per frame
272
+ audio = AudioSegment.from_file(str(audio_path))
273
+ samples = np.array(audio.get_array_of_samples()).astype(np.float32)
274
+ # Stereo to mono if necessary
275
+ if audio.channels > 1:
276
+ samples = samples.reshape((-1, audio.channels)).mean(axis=1)
277
+ frame_size = int(audio.frame_rate / fps)
278
+ n_frames = max(int(len(samples) / frame_size), 1)
279
+ amplitudes = []
280
+ for i in range(n_frames):
281
+ segment = samples[i * frame_size : (i + 1) * frame_size]
282
+ if segment.size == 0:
283
+ amp = 0.0
284
+ else:
285
+ # Root mean square of the audio segment
286
+ amp = float(np.sqrt(np.mean(segment ** 2)))
287
+ amplitudes.append(amp)
288
+ max_amp = max(amplitudes) if amplitudes else 1.0
289
+ if max_amp == 0:
290
+ max_amp = 1.0
291
+ amplitudes = [amp / max_amp for amp in amplitudes]
292
+
293
+ frames = []
294
+ for amp in amplitudes:
295
+ # Compute scaling factor between 1.0 (mouth closed) and 1.6 (fully open)
296
+ factor = 1.0 + amp * 0.6
297
+ frame_bgr = img.copy()
298
+ # Extract mouth ROI
299
+ roi = frame_bgr[mouth_y : mouth_y + mouth_h, mouth_x : mouth_x + mouth_w]
300
+ # Scale ROI vertically
301
+ new_h = max(1, int(mouth_h * factor))
302
+ scaled = cv2.resize(roi, (mouth_w, new_h), interpolation=cv2.INTER_LINEAR)
303
+ # Determine overlay region bounds (ensure we don't write outside image)
304
+ end_y = min(height, mouth_y + new_h)
305
+ overlay = scaled[: end_y - mouth_y, :, :]
306
+ frame_bgr[mouth_y:end_y, mouth_x : mouth_x + mouth_w] = overlay
307
+ # Convert to RGB for MoviePy
308
+ frame_rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
309
+ frames.append(frame_rgb)
310
+
311
+ # Use MoviePy to assemble the video and attach audio
312
  outputs_dir = Path("outputs")
313
  outputs_dir.mkdir(exist_ok=True)
314
+ output_path = outputs_dir / f"simple_{image_path.stem}.mp4"
315
+ clip = mpy.ImageSequenceClip(frames, fps=fps)
316
+ # Attach audio
317
+ audio_clip = mpy.AudioFileClip(str(audio_path))
318
+ # Trim audio to match video length if necessary
319
+ min_duration = min(clip.duration, audio_clip.duration)
320
+ clip = clip.set_audio(audio_clip.subclip(0, min_duration))
321
+ clip = clip.set_duration(min_duration)
322
+ # Write out using H.264 codec and AAC audio. Use preset ultrafast to reduce CPU usage.
323
+ clip.write_videofile(str(output_path), codec="libx264", audio_codec="aac", fps=fps, preset="ultrafast")
 
 
 
 
324
  return output_path
325
 
326