musetalk / README.md
woerns's picture
Mirror MuseTalk V1.5 + deps from upstream (A.I.M.I Stage 2)
787f337 verified
---
license: mit
tags:
- lip-sync
- musetalk
- talking-head
- mirror
library_name: pytorch
---
# MuseTalk Mirror (A.I.M.I)
Mirror of [TMElyralab/MuseTalk](https://huggingface.co/TMElyralab/MuseTalk) V1.5 plus its inference-time dependencies, re-hosted for stable URLs inside the [A.I.M.I](https://aimi.app) desktop product. Contents are unmodified.
MuseTalk re-syncs the lips of an existing video to match a new audio track (mouth-region editing, rest of frame passes through). Pairs with our TTS + Voice-Clone stack for full "text β†’ lip-synced video" workflows.
## Files
| Folder / File | Upstream | Size | Purpose |
|---|---|---|---|
| `musetalkV15/unet.pth` | TMElyralab/MuseTalk | 3.24 GB | MuseTalk V1.5 UNet weights |
| `musetalkV15/musetalk.json` | TMElyralab/MuseTalk | 748 B | UNet config |
| `sd-vae-ft-mse/diffusion_pytorch_model.bin` | stabilityai/sd-vae-ft-mse | 319 MB | VAE for face latents |
| `sd-vae-ft-mse/config.json` | stabilityai/sd-vae-ft-mse | 547 B | VAE config |
| `whisper/pytorch_model.bin` | openai/whisper-tiny | 144 MB | Audio feature extraction (tiny) |
| `dwpose/dw-ll_ucoco_384.pth` | yzd-v/DWPose | 388 MB | Face bbox + pose detection |
| `face-parse-bisent/79999_iter.pth` | ManyOtherFunctions/face-parse-bisent | 51 MB | BiSeNet face-region parser |
| `face-parse-bisent/resnet18-5c106cde.pth` | pytorch.org/models | 45 MB | ResNet18 backbone for face-parser |
Total: ~4.1 GB.
## Licenses
| Component | License |
|---|---|
| MuseTalk | MIT (Tencent Music Entertainment Lyra Lab) |
| SD-VAE-ft-MSE | CreativeML Open RAIL-M (Stability AI) |
| Whisper | MIT (OpenAI) |
| DWPose | Apache 2.0 |
| face-parse-bisent | MIT |
| ResNet18 (pretrained) | BSD-3-Clause (PyTorch / Facebook) |
All components are commercial-use-compatible. Redistributed unchanged. See upstream repos for full license texts.
## Attribution
- **MuseTalk**: Yue Zhang, Minhao Liu, Zhaokang Chen, Bin Wu, Yubin Zeng, Chao Zhan, Yingjie He, Junxin Huang, Wenjiang Zhou β€” *MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting* (2024).
- **Whisper**: Alec Radford et al. β€” *Robust Speech Recognition via Large-Scale Weak Supervision* (OpenAI, 2022).
- **DWPose**: Zhendong Yang, Ailing Zeng, Chun Yuan, Yu Li β€” *Effective Whole-body Pose Estimation with Two-stages Distillation* (ICCV 2023).
- **BiSeNet**: Changqian Yu et al. β€” *BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation* (ECCV 2018).