musetalk / README.md

Mirror MuseTalk V1.5 + deps from upstream (A.I.M.I Stage 2)

787f337 verified about 1 month ago

2.48 kB

	---
	license: mit
	tags:
	- lip-sync
	- musetalk
	- talking-head
	- mirror
	library_name: pytorch
	---

	# MuseTalk Mirror (A.I.M.I)

	Mirror of [TMElyralab/MuseTalk](https://huggingface.co/TMElyralab/MuseTalk) V1.5 plus its inference-time dependencies, re-hosted for stable URLs inside the [A.I.M.I](https://aimi.app) desktop product. Contents are unmodified.

	MuseTalk re-syncs the lips of an existing video to match a new audio track (mouth-region editing, rest of frame passes through). Pairs with our TTS + Voice-Clone stack for full "text → lip-synced video" workflows.

	## Files

	\| Folder / File \| Upstream \| Size \| Purpose \|
	\|---\|---\|---\|---\|
	\| `musetalkV15/unet.pth` \| TMElyralab/MuseTalk \| 3.24 GB \| MuseTalk V1.5 UNet weights \|
	\| `musetalkV15/musetalk.json` \| TMElyralab/MuseTalk \| 748 B \| UNet config \|
	\| `sd-vae-ft-mse/diffusion_pytorch_model.bin` \| stabilityai/sd-vae-ft-mse \| 319 MB \| VAE for face latents \|
	\| `sd-vae-ft-mse/config.json` \| stabilityai/sd-vae-ft-mse \| 547 B \| VAE config \|
	\| `whisper/pytorch_model.bin` \| openai/whisper-tiny \| 144 MB \| Audio feature extraction (tiny) \|
	\| `dwpose/dw-ll_ucoco_384.pth` \| yzd-v/DWPose \| 388 MB \| Face bbox + pose detection \|
	\| `face-parse-bisent/79999_iter.pth` \| ManyOtherFunctions/face-parse-bisent \| 51 MB \| BiSeNet face-region parser \|
	\| `face-parse-bisent/resnet18-5c106cde.pth` \| pytorch.org/models \| 45 MB \| ResNet18 backbone for face-parser \|

	Total: ~4.1 GB.

	## Licenses

	\| Component \| License \|
	\|---\|---\|
	\| MuseTalk \| MIT (Tencent Music Entertainment Lyra Lab) \|
	\| SD-VAE-ft-MSE \| CreativeML Open RAIL-M (Stability AI) \|
	\| Whisper \| MIT (OpenAI) \|
	\| DWPose \| Apache 2.0 \|
	\| face-parse-bisent \| MIT \|
	\| ResNet18 (pretrained) \| BSD-3-Clause (PyTorch / Facebook) \|

	All components are commercial-use-compatible. Redistributed unchanged. See upstream repos for full license texts.

	## Attribution

	- MuseTalk: Yue Zhang, Minhao Liu, Zhaokang Chen, Bin Wu, Yubin Zeng, Chao Zhan, Yingjie He, Junxin Huang, Wenjiang Zhou — MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting (2024).
	- Whisper: Alec Radford et al. — Robust Speech Recognition via Large-Scale Weak Supervision (OpenAI, 2022).
	- DWPose: Zhendong Yang, Ailing Zeng, Chun Yuan, Yu Li — Effective Whole-body Pose Estimation with Two-stages Distillation (ICCV 2023).
	- BiSeNet: Changqian Yu et al. — BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation (ECCV 2018).

	---
	license: mit
	tags:
	- lip-sync
	- musetalk
	- talking-head
	- mirror
	library_name: pytorch
	---

	# MuseTalk Mirror (A.I.M.I)

	Mirror of [TMElyralab/MuseTalk](https://huggingface.co/TMElyralab/MuseTalk) V1.5 plus its inference-time dependencies, re-hosted for stable URLs inside the [A.I.M.I](https://aimi.app) desktop product. Contents are unmodified.

	MuseTalk re-syncs the lips of an existing video to match a new audio track (mouth-region editing, rest of frame passes through). Pairs with our TTS + Voice-Clone stack for full "text → lip-synced video" workflows.

	## Files

	\| Folder / File \| Upstream \| Size \| Purpose \|
	\|---\|---\|---\|---\|
	\| `musetalkV15/unet.pth` \| TMElyralab/MuseTalk \| 3.24 GB \| MuseTalk V1.5 UNet weights \|
	\| `musetalkV15/musetalk.json` \| TMElyralab/MuseTalk \| 748 B \| UNet config \|
	\| `sd-vae-ft-mse/diffusion_pytorch_model.bin` \| stabilityai/sd-vae-ft-mse \| 319 MB \| VAE for face latents \|
	\| `sd-vae-ft-mse/config.json` \| stabilityai/sd-vae-ft-mse \| 547 B \| VAE config \|
	\| `whisper/pytorch_model.bin` \| openai/whisper-tiny \| 144 MB \| Audio feature extraction (tiny) \|
	\| `dwpose/dw-ll_ucoco_384.pth` \| yzd-v/DWPose \| 388 MB \| Face bbox + pose detection \|
	\| `face-parse-bisent/79999_iter.pth` \| ManyOtherFunctions/face-parse-bisent \| 51 MB \| BiSeNet face-region parser \|
	\| `face-parse-bisent/resnet18-5c106cde.pth` \| pytorch.org/models \| 45 MB \| ResNet18 backbone for face-parser \|

	Total: ~4.1 GB.

	## Licenses

	\| Component \| License \|
	\|---\|---\|
	\| MuseTalk \| MIT (Tencent Music Entertainment Lyra Lab) \|
	\| SD-VAE-ft-MSE \| CreativeML Open RAIL-M (Stability AI) \|
	\| Whisper \| MIT (OpenAI) \|
	\| DWPose \| Apache 2.0 \|
	\| face-parse-bisent \| MIT \|
	\| ResNet18 (pretrained) \| BSD-3-Clause (PyTorch / Facebook) \|

	All components are commercial-use-compatible. Redistributed unchanged. See upstream repos for full license texts.

	## Attribution

	- MuseTalk: Yue Zhang, Minhao Liu, Zhaokang Chen, Bin Wu, Yubin Zeng, Chao Zhan, Yingjie He, Junxin Huang, Wenjiang Zhou — MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting (2024).
	- Whisper: Alec Radford et al. — Robust Speech Recognition via Large-Scale Weak Supervision (OpenAI, 2022).
	- DWPose: Zhendong Yang, Ailing Zeng, Chun Yuan, Yu Li — Effective Whole-body Pose Estimation with Two-stages Distillation (ICCV 2023).
	- BiSeNet: Changqian Yu et al. — BiSeNet: Bilateral Segmentation Network for Real-time Semantic Segmentation (ECCV 2018).