AEmotionStudio
/

mmaudio-models

@@ -9,17 +9,34 @@ tags:
 # MMAudio Models (FP16 Safetensors)
-FP16 `.safetensors` conversions of [MMAudio](https://github.com/hkchengrex/MMAudio) model checkpoints for use with [ComfyUI-FFMPEGA](https://github.com/AEmotionStudio/ComfyUI-FFMPEGA).
 ## Models
-| File | Description | FP16 Size | Original Size |
-|------|-------------|-----------|---------------|
-| `mmaudio_large_44k_v2.safetensors` | MMAudio large model (44kHz, v2) | 1,966 MB | 3,932 MB |
-| `v1-44.safetensors` | VAE decoder (44kHz) | 583 MB | 1,165 MB |
-| `synchformer_state_dict.safetensors` | Synchformer temporal encoder | 453 MB | 906 MB |
-All models have been converted from FP32 `.pth` → **FP16** `.safetensors` (50% size reduction).
 ## Usage
@@ -28,17 +45,22 @@ These models are automatically downloaded by the `generate_audio` skill in Comfy
 **Manual installation:**
 ```
 ComfyUI/models/mmaudio/
-├── mmaudio_large_44k_v2.pth   (converted from .safetensors on download)
-├── v1-44.pth
-└── synchformer_state_dict.pth
 ```
 ## License
-> ⚠️ **CC-BY-NC 4.0** — These model checkpoints are licensed under [Creative Commons Attribution-NonCommercial 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
 > Commercial use of the models is restricted. The code that loads/runs them is MIT/GPL-3.0.
-## Source
-Original models: [hkchengrex/MMAudio](https://huggingface.co/hkchengrex/MMAudio)
-Paper: *Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis* (CVPR 2025)

 # MMAudio Models (FP16 Safetensors)
+FP16 `.safetensors` conversions of [MMAudio](https://github.com/hkchengrex/MMAudio) model checkpoints and dependencies for use with [ComfyUI-FFMPEGA](https://github.com/AEmotionStudio/ComfyUI-FFMPEGA).
 ## Models
+### MMAudio Core
+| File | Description | FP16 Size |
+|------|-------------|-----------|
+| `mmaudio_large_44k_v2.safetensors` | MMAudio large model (44kHz, v2) | 1,966 MB |
+| `v1-44.safetensors` | VAE decoder (44kHz) | 583 MB |
+| `synchformer_state_dict.safetensors` | Synchformer temporal encoder | 453 MB |
+### Dependencies
+| File | Description | Size |
+|------|-------------|------|
+| `apple_DFN5B-CLIP-ViT-H-14-384_fp16.safetensors` | CLIP vision encoder (fp16) | 1,882 MB |
+| `bigvgan_v2_44khz_128band_512x/` | BigVGAN v2 vocoder (44kHz) | ~467 MB |
+All models converted from original FP32 sources → **FP16** `.safetensors` (50% size reduction).
+## Sources
+All models were downloaded from their **original sources** and converted by us:
+- **MMAudio**: [hkchengrex/MMAudio](https://huggingface.co/hkchengrex/MMAudio)
+- **CLIP**: [apple/DFN5B-CLIP-ViT-H-14-384](https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14-384)
+- **BigVGAN**: [nvidia/bigvgan_v2_44khz_128band_512x](https://huggingface.co/nvidia/bigvgan_v2_44khz_128band_512x)
 ## Usage
 **Manual installation:**
 ```
 ComfyUI/models/mmaudio/
+├── mmaudio_large_44k_v2.safetensors
+├── v1-44.safetensors
+├── synchformer_state_dict.safetensors
+├── apple_DFN5B-CLIP-ViT-H-14-384_fp16.safetensors
+└── bigvgan_v2_44khz_128band_512x/
+    ├── config.json
+    └── bigvgan_generator.pt
 ```
 ## License
+> ⚠️ **CC-BY-NC 4.0** — MMAudio model checkpoints are licensed under [Creative Commons Attribution-NonCommercial 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
 > Commercial use of the models is restricted. The code that loads/runs them is MIT/GPL-3.0.
+>
+> BigVGAN is licensed under MIT. CLIP (DFN5B) is licensed under Apple's research license.
+## Paper
+*Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis* (CVPR 2025)