Spaces:
Running
Running
| title: MuseTalk | |
| emoji: 💻 | |
| colorFrom: indigo | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: false | |
| short_description: MuseTalk - Real-time Audio-Driven Lip Sync | |
| # MuseTalk: Real-Time High-Quality Lip Synchronization | |
| This Hugging Face Space allows you to run MuseTalk for audio-driven lip synchronization experiments. | |
| ## About MuseTalk | |
| MuseTalk is a real-time, high-quality audio-driven lip synchronization model that generates realistic lip movements from audio input. It can be applied to videos to create lip-synced content. | |
| ## Features | |
| - **Real-time Processing**: Generate lip-synced videos efficiently | |
| - **High Quality**: Produces natural and realistic lip movements | |
| - **Easy to Use**: Simple Gradio interface for quick experimentation | |
| - **Customizable**: Adjust bounding box positions for better results | |
| ## How to Use | |
| 1. **Upload Video**: Provide an input video file (preferably with a clear face) | |
| 2. **Upload Audio**: Provide an audio file with the target speech | |
| 3. **Adjust Parameters**: (Optional) Fine-tune the bbox_shift parameter | |
| 4. **Generate**: Click the "Generate" button to create your lip-synced video | |
| ## Model Information | |
| - **Model Weights**: [TMElyralab/MuseTalk](https://huggingface.co/TMElyralab/MuseTalk) | |
| - **GitHub Repository**: [TMElyralab/MuseTalk](https://github.com/TMElyralab/MuseTalk) | |
| ## Requirements | |
| The Space automatically installs all necessary dependencies including: | |
| - PyTorch and Torchvision | |
| - Gradio for the UI | |
| - OpenCV for video processing | |
| - Various ML libraries (transformers, diffusers, etc.) | |
| ## Setup Instructions | |
| This Space is configured to: | |
| 1. Clone the MuseTalk repository on first run | |
| 2. Install all required dependencies from requirements.txt | |
| 3. Download necessary model weights automatically | |
| 4. Launch the Gradio interface | |
| ## Technical Details | |
| **Required Model Components:** | |
| - VAE: sd-vae-ft-mse from Stability AI | |
| - Whisper: For audio processing | |
| - DWPose: For pose estimation | |
| - Face Parsing: For face segmentation | |
| - ResNet18: For feature extraction | |
| ## Tips for Best Results | |
| - Use videos with clear, well-lit faces | |
| - Ensure audio quality is good for better lip sync | |
| - Adjust the bbox_shift parameter if the face detection is off-center | |
| - Input videos should ideally be in MP4 format | |
| ## Citation | |
| If you use MuseTalk in your research or projects, please cite the original repository: | |
| ``` | |
| @misc{musetalk2024, | |
| title={MuseTalk: Real-Time High-Quality Lip Synchronization}, | |
| author={TMElyralab}, | |
| year={2024}, | |
| url={https://github.com/TMElyralab/MuseTalk} | |
| } | |
| ``` | |
| ## Related Projects | |
| - [MuseV](https://github.com/TMElyralab/MuseV) - For text-to-video generation | |
| ## License | |
| Please refer to the [original repository](https://github.com/TMElyralab/MuseTalk) for licensing information. | |
| --- | |
| **Note**: First-time setup may take several minutes as model weights (~2GB) are downloaded automatically. |