MMAudio Models (FP16 Safetensors)
FP16 .safetensors conversions of MMAudio model checkpoints and dependencies for use with ComfyUI-FFMPEGA.
Models
MMAudio Core
| File | Description | FP16 Size |
|---|---|---|
mmaudio_large_44k_v2.safetensors |
MMAudio large model (44kHz, v2) | 1,966 MB |
v1-44.safetensors |
VAE decoder (44kHz) | 583 MB |
synchformer_state_dict.safetensors |
Synchformer temporal encoder | 453 MB |
Dependencies
| File | Description | Size |
|---|---|---|
apple_DFN5B-CLIP-ViT-H-14-384_fp16.safetensors |
CLIP vision encoder (fp16) | 1,882 MB |
bigvgan_v2_44khz_128band_512x/ |
BigVGAN v2 vocoder (44kHz) | ~467 MB |
All models converted from original FP32 sources → FP16 .safetensors (50% size reduction).
Sources
All models were downloaded from their original sources and converted by us:
- MMAudio: hkchengrex/MMAudio
- CLIP: apple/DFN5B-CLIP-ViT-H-14-384
- BigVGAN: nvidia/bigvgan_v2_44khz_128band_512x
Usage
These models are automatically downloaded by the generate_audio skill in ComfyUI-FFMPEGA. No manual setup needed.
Manual installation:
ComfyUI/models/mmaudio/
├── mmaudio_large_44k_v2.safetensors
├── v1-44.safetensors
├── synchformer_state_dict.safetensors
├── apple_DFN5B-CLIP-ViT-H-14-384_fp16.safetensors
└── bigvgan_v2_44khz_128band_512x/
├── config.json
└── bigvgan_generator.pt
License
⚠️ CC-BY-NC 4.0 — MMAudio model checkpoints are licensed under Creative Commons Attribution-NonCommercial 4.0. Commercial use of the models is restricted. The code that loads/runs them is MIT/GPL-3.0.
BigVGAN is licensed under MIT. CLIP (DFN5B) is licensed under Apple's research license.
Paper
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis (CVPR 2025)