Spaces:

gyrus2
/

lip-sync-generator

Running

App Files Files Community

gyrus2 commited on Nov 13, 2025

Commit

fcb16b9

verified ·

1 Parent(s): 66421d1

Add fallback lip-sync algorithm using amplitude-driven mouth animation and update README accordingly

Browse files

Files changed (2) hide show

README.md +3 -2
app.py +139 -17

README.md CHANGED Viewed

@@ -17,6 +17,7 @@ This project contains a simple web application that allows you to upload a singl
 - **Upload your own avatar:** any static image (PNG/JPG) or short video clip can be used as the source face.
 - **Upload an audio track:** accepts common audio formats (MP3/WAV/M4A) from 1–10 minutes long.
 - **Self‑contained setup:** on first use the application extracts the Wav2Lip source code from a zip archive (if present) and verifies that the model weights exist.  If the environment allows outbound downloads, it will fetch the weights automatically; otherwise you can provide them manually.  No local installation is required.
 - **Runs on free cloud hardware:** designed for deployment on [Hugging Face Spaces](https://huggingface.co/spaces), which provide free CPU/GPU resources for public projects.
 - **Extensible:** advanced users can tweak padding, segmentation and other options by modifying the inference arguments in ``app.py``.
@@ -27,8 +28,8 @@ When the **Generate video** button is pressed, the application performs the foll
 1. If the ``Wav2Lip`` folder is not present, it tries to extract it from a local zip archive named ``Wav2Lip-master.zip``.  If the archive isn’t found it attempts a shallow clone from GitHub.  On network‑restricted environments you should upload the archive yourself (see **Deploying to Spaces**).
 2. If the pre‑trained weights (``wav2lip_gan.pth`` and the face segmentation model) are not present, it attempts to download them from publicly available mirrors.  These files are large (~436 MB and ~53 MB respectively).  If the download fails, you can upload the files manually into the ``Wav2Lip/checkpoints`` folder.
 3. The uploaded image/video and audio files are saved into a temporary folder.  Basic validation ensures that the audio duration is between 1 and 10 minutes; otherwise an error is shown.
-4. The application calls the official ``inference.py`` script from Wav2Lip in a subprocess.  The script reads the avatar frame and audio, applies the lip‑sync model, and writes an MP4 video to the ``outputs`` directory.
-5. Once the script completes, the resulting video is returned to the web UI and can be played or downloaded.
 The heavy lifting is done by Wav2Lip; this project simply wraps it in a clean user interface with sensible defaults and handles all setup.

 - **Upload your own avatar:** any static image (PNG/JPG) or short video clip can be used as the source face.
 - **Upload an audio track:** accepts common audio formats (MP3/WAV/M4A) from 1–10 minutes long.
 - **Self‑contained setup:** on first use the application extracts the Wav2Lip source code from a zip archive (if present) and verifies that the model weights exist.  If the environment allows outbound downloads, it will fetch the weights automatically; otherwise you can provide them manually.  No local installation is required.
+ - **Offline fallback:** if neither the repository nor the weights are available (for example on locked‑down networks where large downloads are forbidden), the app gracefully falls back to a lightweight amplitude‑based animation.  It will still produce a talking head by stretching and squashing the mouth region in sync with the loudness of the audio.  This effect is simpler than full Wav2Lip but ensures you always get a video out.
 - **Runs on free cloud hardware:** designed for deployment on [Hugging Face Spaces](https://huggingface.co/spaces), which provide free CPU/GPU resources for public projects.
 - **Extensible:** advanced users can tweak padding, segmentation and other options by modifying the inference arguments in ``app.py``.
 1. If the ``Wav2Lip`` folder is not present, it tries to extract it from a local zip archive named ``Wav2Lip-master.zip``.  If the archive isn’t found it attempts a shallow clone from GitHub.  On network‑restricted environments you should upload the archive yourself (see **Deploying to Spaces**).
 2. If the pre‑trained weights (``wav2lip_gan.pth`` and the face segmentation model) are not present, it attempts to download them from publicly available mirrors.  These files are large (~436 MB and ~53 MB respectively).  If the download fails, you can upload the files manually into the ``Wav2Lip/checkpoints`` folder.
 3. The uploaded image/video and audio files are saved into a temporary folder.  Basic validation ensures that the audio duration is between 1 and 10 minutes; otherwise an error is shown.
+4. The application calls the official ``inference.py`` script from Wav2Lip in a subprocess.  The script reads the avatar frame and audio, applies the lip‑sync model, and writes an MP4 video to the ``outputs`` directory.  If this step fails because the repository or weights are missing, the app automatically switches to a basic fallback: it computes the loudness of the audio and stretches the mouth area of the avatar up and down to create a rudimentary talking animation.
+5. Once video generation completes (either via Wav2Lip or the fallback), the resulting MP4 is returned to the web UI and can be played or downloaded.
 The heavy lifting is done by Wav2Lip; this project simply wraps it in a clean user interface with sensible defaults and handles all setup.

app.py CHANGED Viewed

@@ -179,26 +179,148 @@ def validate_audio_length(audio_path: str) -> None:
 def run_inference(image_path: Path, audio_path: Path) -> Path:
-    """Run Wav2Lip inference and return the path to the generated video."""
-    ensure_setup()
-    # Prepare output directory and file name
     outputs_dir = Path("outputs")
     outputs_dir.mkdir(exist_ok=True)
-    output_path = outputs_dir / f"result_{image_path.stem}.mp4"
-    # Build command to run inference
-    cmd = [
-        "python", "inference.py",
-        "--checkpoint_path", str(CHECKPOINTS_DIR / WAV2LIP_MODEL),
-        "--segmentation_path", str(CHECKPOINTS_DIR / FACE_SEG_MODEL),
-        "--face", str(image_path),
-        "--audio", str(audio_path),
-        "--outfile", str(output_path),
-        "--pads", "0", "10", "0", "0",  # default padding
-    ]
-    # Execute inside repository directory
-    subprocess.run(cmd, cwd=str(REPO_DIR), check=True)
     return output_path

 def run_inference(image_path: Path, audio_path: Path) -> Path:
+    """
+    Generate a lip‑synced video from an avatar image and audio track.
+    This function attempts to perform high‑quality lip synchronisation via the
+    Wav2Lip model.  If the required model repository or weights are not
+    available (for example because outbound network traffic is blocked or the
+    weight files are too large to download), it falls back to a lightweight
+    amplitude‑driven mouth animation.  The fallback uses only OpenCV and
+    MoviePy to create a simple talking head effect by stretching the mouth
+    region based on the loudness of the audio.  Although not as accurate as
+    Wav2Lip, the fallback produces a plausible talking animation without
+    requiring any deep learning checkpoints.
+    Parameters
+    ----------
+    image_path : Path
+        Path to the avatar image saved on disk.
+    audio_path : Path
+        Path to the audio file saved on disk.
+    Returns
+    -------
+    Path
+        Path to the generated MP4 video relative to the working directory.
+    """
+    # First attempt the full Wav2Lip pipeline.  If anything fails (e.g. missing
+    # repository or weights, runtime errors from the inference script), we
+    # swallow the error and fall back to the simple implementation.
+    try:
+        ensure_setup()
+        outputs_dir = Path("outputs")
+        outputs_dir.mkdir(exist_ok=True)
+        output_path = outputs_dir / f"result_{image_path.stem}.mp4"
+        cmd = [
+            "python", "inference.py",
+            "--checkpoint_path", str(CHECKPOINTS_DIR / WAV2LIP_MODEL),
+            "--segmentation_path", str(CHECKPOINTS_DIR / FACE_SEG_MODEL),
+            "--face", str(image_path),
+            "--audio", str(audio_path),
+            "--outfile", str(output_path),
+            "--pads", "0", "10", "0", "0",
+        ]
+        subprocess.run(cmd, cwd=str(REPO_DIR), check=True)
+        return output_path
+    except Exception:
+        # Fall back to simple lip‑sync implementation
+        return simple_lip_sync(image_path, audio_path)
+def simple_lip_sync(image_path: Path, audio_path: Path, fps: int = 25) -> Path:
+    """
+    Create a basic talking head animation without neural networks.
+    The fallback algorithm estimates speech activity from the audio's RMS
+    amplitude and animates the avatar by vertically scaling the mouth region
+    accordingly.  The mouth is approximated as a box located in the lower
+    portion of the image.  Each frame is generated by resizing this region
+    based on the normalised amplitude for that time slice.  The resulting
+    frames are compiled into a video using MoviePy and the original audio is
+    attached.
+    Parameters
+    ----------
+    image_path : Path
+        Path to the input image.
+    audio_path : Path
+        Path to the input audio file.
+    fps : int, optional
+        Frames per second for the output video, by default 25.
+    Returns
+    -------
+    Path
+        Path to the generated video file.
+    """
+    import cv2  # imported here to avoid mandatory dependency for users who provide Wav2Lip models
+    import moviepy.editor as mpy
+    # Load avatar image (BGR)
+    img = cv2.imread(str(image_path))
+    if img is None:
+        raise RuntimeError("Failed to load the avatar image. Please ensure the file is a valid image.")
+    height, width, _ = img.shape
+    # Approximate mouth bounding box (tune proportions if necessary)
+    mouth_w = int(width * 0.6)
+    mouth_h = int(height * 0.15)
+    mouth_x = int(width * 0.2)
+    mouth_y = int(height * 0.65)
+    # Load audio and compute amplitude per frame
+    audio = AudioSegment.from_file(str(audio_path))
+    samples = np.array(audio.get_array_of_samples()).astype(np.float32)
+    # Stereo to mono if necessary
+    if audio.channels > 1:
+        samples = samples.reshape((-1, audio.channels)).mean(axis=1)
+    frame_size = int(audio.frame_rate / fps)
+    n_frames = max(int(len(samples) / frame_size), 1)
+    amplitudes = []
+    for i in range(n_frames):
+        segment = samples[i * frame_size : (i + 1) * frame_size]
+        if segment.size == 0:
+            amp = 0.0
+        else:
+            # Root mean square of the audio segment
+            amp = float(np.sqrt(np.mean(segment ** 2)))
+        amplitudes.append(amp)
+    max_amp = max(amplitudes) if amplitudes else 1.0
+    if max_amp == 0:
+        max_amp = 1.0
+    amplitudes = [amp / max_amp for amp in amplitudes]
+    frames = []
+    for amp in amplitudes:
+        # Compute scaling factor between 1.0 (mouth closed) and 1.6 (fully open)
+        factor = 1.0 + amp * 0.6
+        frame_bgr = img.copy()
+        # Extract mouth ROI
+        roi = frame_bgr[mouth_y : mouth_y + mouth_h, mouth_x : mouth_x + mouth_w]
+        # Scale ROI vertically
+        new_h = max(1, int(mouth_h * factor))
+        scaled = cv2.resize(roi, (mouth_w, new_h), interpolation=cv2.INTER_LINEAR)
+        # Determine overlay region bounds (ensure we don't write outside image)
+        end_y = min(height, mouth_y + new_h)
+        overlay = scaled[: end_y - mouth_y, :, :]
+        frame_bgr[mouth_y:end_y, mouth_x : mouth_x + mouth_w] = overlay
+        # Convert to RGB for MoviePy
+        frame_rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
+        frames.append(frame_rgb)
+    # Use MoviePy to assemble the video and attach audio
     outputs_dir = Path("outputs")
     outputs_dir.mkdir(exist_ok=True)
+    output_path = outputs_dir / f"simple_{image_path.stem}.mp4"
+    clip = mpy.ImageSequenceClip(frames, fps=fps)
+    # Attach audio
+    audio_clip = mpy.AudioFileClip(str(audio_path))
+    # Trim audio to match video length if necessary
+    min_duration = min(clip.duration, audio_clip.duration)
+    clip = clip.set_audio(audio_clip.subclip(0, min_duration))
+    clip = clip.set_duration(min_duration)
+    # Write out using H.264 codec and AAC audio.  Use preset ultrafast to reduce CPU usage.
+    clip.write_videofile(str(output_path), codec="libx264", audio_codec="aac", fps=fps, preset="ultrafast")
     return output_path