| --- |
| license: cc-by-nc-4.0 |
| tags: |
| - audiox |
| - audio-generation |
| - music-generation |
| - text-to-audio |
| - video-to-audio |
| - audio-inpainting |
| - safetensors |
| base_model: |
| - HKUSTAudio/AudioX |
| - HKUSTAudio/AudioX-MAF |
| pipeline_tag: text-to-audio |
| --- |
| |
| # AudioX Models (Safetensors) |
|
|
| `.safetensors` conversions of [AudioX-MAF](https://huggingface.co/HKUSTAudio/AudioX-MAF) model checkpoints for use with [ComfyUI-FFMPEGA](https://github.com/AEmotionStudio/ComfyUI-FFMPEGA). |
|
|
| AudioX is a unified anything-to-audio model from ICLR 2026 that supports text-to-audio, text-to-music, video-to-audio, and audio inpainting. |
|
|
| ## Models |
|
|
| | File | Description | Size | |
| |------|-------------|------| |
| | `model.safetensors` | AudioX-MAF DiT model (full precision) | 5.19 GB | |
| | `synchformer_state_dict.safetensors` | Synchformer temporal encoder (shared with MMAudio) | 475 MB | |
| | `config.json` | Model architecture configuration | 3.3 KB | |
|
|
| ## Sources |
|
|
| All models were downloaded from their **original sources** and converted by us: |
|
|
| - **AudioX-MAF**: [HKUSTAudio/AudioX-MAF](https://huggingface.co/HKUSTAudio/AudioX-MAF) |
| - **Synchformer**: Shared with [MMAudio](https://huggingface.co/hkchengrex/MMAudio) |
|
|
| ## Usage |
|
|
| These models are automatically downloaded by the `generate_music` and `audio_inpaint` skills in ComfyUI-FFMPEGA. No manual setup needed. |
|
|
| **Manual installation:** |
| ``` |
| ComfyUI/models/audiox/ |
| ├── model.safetensors |
| ├── synchformer_state_dict.safetensors |
| └── config.json |
| ``` |
|
|
| > **Note:** The `synchformer_state_dict.safetensors` is shared with MMAudio. If you already have it in `ComfyUI/models/mmaudio/`, AudioX will reuse it automatically — no duplicate download needed. |
|
|
| ## Capabilities |
|
|
| | Skill | Description | |
| |-------|-------------| |
| | `generate_music` | Text-to-music and video-to-music generation | |
| | `audio_inpaint` | Fill gaps, extend, or regenerate sections of audio | |
|
|
| ## License |
|
|
| > ⚠️ **CC-BY-NC 4.0** — AudioX model weights are licensed under [Creative Commons Attribution-NonCommercial 4.0](https://creativecommons.org/licenses/by-nc/4.0/). |
| > Commercial use of the models is restricted. The code that loads/runs them is GPL-3.0. |
|
|
| ## Paper |
|
|
| *AudioX: Diffusion Transformer for Anything-to-Audio Generation* (ICLR 2026) |
| [arXiv:2503.10522](https://arxiv.org/abs/2503.10522) |
|
|