lip-sync-generator / README.md
gyrus2's picture
Add fallback lip-sync algorithm using amplitude-driven mouth animation and update README accordingly
fcb16b9 verified

A newer version of the Gradio SDK is available: 6.2.0

Upgrade
metadata
title: Lip Sync Generator
emoji: 🎵
colorFrom: indigo
colorTo: blue
sdk: gradio
app_file: app.py
pinned: false

Lip-Sync Video Generator

This project contains a simple web application that allows you to upload a single frame (or short clip) of an avatar and an audio file and then produce a lip‑synchronised video. Internally the app uses the open‑source Wav2Lip model to animate the avatar so that the mouth movements follow the uploaded audio. Everything runs on a free cloud platform – there is no need to install anything locally.

Features

  • Upload your own avatar: any static image (PNG/JPG) or short video clip can be used as the source face.
  • Upload an audio track: accepts common audio formats (MP3/WAV/M4A) from 1–10 minutes long.
  • Self‑contained setup: on first use the application extracts the Wav2Lip source code from a zip archive (if present) and verifies that the model weights exist. If the environment allows outbound downloads, it will fetch the weights automatically; otherwise you can provide them manually. No local installation is required.
  • Offline fallback: if neither the repository nor the weights are available (for example on locked‑down networks where large downloads are forbidden), the app gracefully falls back to a lightweight amplitude‑based animation. It will still produce a talking head by stretching and squashing the mouth region in sync with the loudness of the audio. This effect is simpler than full Wav2Lip but ensures you always get a video out.
  • Runs on free cloud hardware: designed for deployment on Hugging Face Spaces, which provide free CPU/GPU resources for public projects.
  • Extensible: advanced users can tweak padding, segmentation and other options by modifying the inference arguments in app.py.

How it works

When the Generate video button is pressed, the application performs the following steps:

  1. If the Wav2Lip folder is not present, it tries to extract it from a local zip archive named Wav2Lip-master.zip. If the archive isn’t found it attempts a shallow clone from GitHub. On network‑restricted environments you should upload the archive yourself (see Deploying to Spaces).
  2. If the pre‑trained weights (wav2lip_gan.pth and the face segmentation model) are not present, it attempts to download them from publicly available mirrors. These files are large (~436 MB and ~53 MB respectively). If the download fails, you can upload the files manually into the Wav2Lip/checkpoints folder.
  3. The uploaded image/video and audio files are saved into a temporary folder. Basic validation ensures that the audio duration is between 1 and 10 minutes; otherwise an error is shown.
  4. The application calls the official inference.py script from Wav2Lip in a subprocess. The script reads the avatar frame and audio, applies the lip‑sync model, and writes an MP4 video to the outputs directory. If this step fails because the repository or weights are missing, the app automatically switches to a basic fallback: it computes the loudness of the audio and stretches the mouth area of the avatar up and down to create a rudimentary talking animation.
  5. Once video generation completes (either via Wav2Lip or the fallback), the resulting MP4 is returned to the web UI and can be played or downloaded.

The heavy lifting is done by Wav2Lip; this project simply wraps it in a clean user interface with sensible defaults and handles all setup.

Deploying to Hugging Face Spaces

  1. Create a free account: go to huggingface.co and create a free account if you don’t already have one.
  2. Create a new Space: from your dashboard, click “New Space”, choose the Gradio SDK and give your space a name (e.g. lip-sync-app). Set it to “Public” so that it can use free hardware.
  3. Upload the project files: clone or download this repository, then upload the contents of the lipsync_app folder (app.py, requirements.txt and this README.md) into your new space. The file structure should look like this:
├── app.py
├── requirements.txt
└── README.md
  1. (Optional) Upload the Wav2Lip source and weights: in environments without internet access you should provide two additional assets:
    • A zip of the Wav2Lip repository named Wav2Lip-master.zip (download it from https://github.com/Rudrabha/Wav2Lip using the Download ZIP button). Place the archive at the root of your space so the app can extract it.
    • The model checkpoints wav2lip_gan.pth (≈436 MB) and face_segmentation.pth (≈53 MB). Upload these files into a folder Wav2Lip/checkpoints in your space. You can obtain them from the links in MODEL_URLS inside app.py.
  2. Commit and build: once all files are uploaded, commit them to the Space. Hugging Face will install the Python dependencies and build the application automatically. The first build may take a few minutes.
  3. Use the app: after the build succeeds and the weights are available, open the App tab of your space. You can now upload an image and an audio file and click Generate video to produce a lip‑synchronised output. Longer audio clips (up to ten minutes) will take longer to process.

Limitations

  • Large downloads on first run: the Wav2Lip weights are hundreds of megabytes. The application caches them in the working directory so subsequent runs are faster. If the environment is reset (e.g. if the space times out), the weights will be downloaded again.
  • Processing time: running Wav2Lip on CPU is slow (several minutes for a 1‑minute clip). For best performance, enable GPU hardware in your Hugging Face space settings. GPU hours are free on public spaces but limited, so plan accordingly.
  • Avatar quality: Wav2Lip works best with clear, front‑facing images where the mouth is visible. Complex backgrounds or occlusions can degrade the result.

Troubleshooting

  • Error cloning repository: network‑restricted environments may forbid git operations. Download the Wav2Lip source code as a zip file on your own machine via the Download ZIP option on GitHub. Rename the archive to Wav2Lip-master.zip and upload it into the root of your space. The app will extract it automatically.
  • Model download fails: large binary files often cannot be fetched from within a Space. Download the files listed in MODEL_URLS in app.py (wav2lip_gan.pth and face_segmentation.pth) to your computer and upload them into Wav2Lip/checkpoints in your space. Once present, the app will skip downloading them.
  • Inference error / missing FFmpeg: the Wav2Lip inference script requires the ffmpeg binary to combine audio and video. If your space does not have ffmpeg installed, consider enabling GPU hardware (which comes with ffmpeg) or add a static ffmpeg binary into your repository and modify PATH accordingly.
  • App times out or runs out of memory: try reducing the audio length or using CPU hardware. Free GPU instances provide limited memory (T4/8 GB), which may not handle extremely high resolution inputs.

Acknowledgements

This project would not be possible without the original Wav2Lip authors and their permissive MIT licence. It also makes use of Gradio for the web interface.