MMAudio Models (FP16 Safetensors)

FP16 .safetensors conversions of MMAudio model checkpoints and dependencies for use with ComfyUI-FFMPEGA.

Models

MMAudio Core

File	Description	FP16 Size
`mmaudio_large_44k_v2.safetensors`	MMAudio large model (44kHz, v2)	1,966 MB
`v1-44.safetensors`	VAE decoder (44kHz)	583 MB
`synchformer_state_dict.safetensors`	Synchformer temporal encoder	453 MB

Dependencies

File	Description	Size
`apple_DFN5B-CLIP-ViT-H-14-384_fp16.safetensors`	CLIP vision encoder (fp16)	1,882 MB
`bigvgan_v2_44khz_128band_512x/`	BigVGAN v2 vocoder (44kHz)	~467 MB

All models converted from original FP32 sources → FP16 .safetensors (50% size reduction).

Sources

All models were downloaded from their original sources and converted by us:

MMAudio: hkchengrex/MMAudio
CLIP: apple/DFN5B-CLIP-ViT-H-14-384
BigVGAN: nvidia/bigvgan_v2_44khz_128band_512x

Usage

These models are automatically downloaded by the generate_audio skill in ComfyUI-FFMPEGA. No manual setup needed.

Manual installation:

ComfyUI/models/mmaudio/
├── mmaudio_large_44k_v2.safetensors
├── v1-44.safetensors
├── synchformer_state_dict.safetensors
├── apple_DFN5B-CLIP-ViT-H-14-384_fp16.safetensors
└── bigvgan_v2_44khz_128band_512x/
    ├── config.json
    └── bigvgan_generator.pt

License

⚠️ CC-BY-NC 4.0 — MMAudio model checkpoints are licensed under Creative Commons Attribution-NonCommercial 4.0. Commercial use of the models is restricted. The code that loads/runs them is MIT/GPL-3.0.

BigVGAN is licensed under MIT. CLIP (DFN5B) is licensed under Apple's research license.

Paper

Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis (CVPR 2025)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support