MMAudio Models (FP16 Safetensors)

FP16 .safetensors conversions of MMAudio model checkpoints and dependencies for use with ComfyUI-FFMPEGA.

Models

MMAudio Core

File Description FP16 Size
mmaudio_large_44k_v2.safetensors MMAudio large model (44kHz, v2) 1,966 MB
v1-44.safetensors VAE decoder (44kHz) 583 MB
synchformer_state_dict.safetensors Synchformer temporal encoder 453 MB

Dependencies

File Description Size
apple_DFN5B-CLIP-ViT-H-14-384_fp16.safetensors CLIP vision encoder (fp16) 1,882 MB
bigvgan_v2_44khz_128band_512x/ BigVGAN v2 vocoder (44kHz) ~467 MB

All models converted from original FP32 sources → FP16 .safetensors (50% size reduction).

Sources

All models were downloaded from their original sources and converted by us:

Usage

These models are automatically downloaded by the generate_audio skill in ComfyUI-FFMPEGA. No manual setup needed.

Manual installation:

ComfyUI/models/mmaudio/
├── mmaudio_large_44k_v2.safetensors
├── v1-44.safetensors
├── synchformer_state_dict.safetensors
├── apple_DFN5B-CLIP-ViT-H-14-384_fp16.safetensors
└── bigvgan_v2_44khz_128band_512x/
    ├── config.json
    └── bigvgan_generator.pt

License

⚠️ CC-BY-NC 4.0 — MMAudio model checkpoints are licensed under Creative Commons Attribution-NonCommercial 4.0. Commercial use of the models is restricted. The code that loads/runs them is MIT/GPL-3.0.

BigVGAN is licensed under MIT. CLIP (DFN5B) is licensed under Apple's research license.

Paper

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis (CVPR 2025)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support