File size: 2,291 Bytes
9edfe69 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | ---
license: cc-by-nc-4.0
tags:
- audiox
- audio-generation
- music-generation
- text-to-audio
- video-to-audio
- audio-inpainting
- safetensors
base_model:
- HKUSTAudio/AudioX
- HKUSTAudio/AudioX-MAF
pipeline_tag: text-to-audio
---
# AudioX Models (Safetensors)
`.safetensors` conversions of [AudioX-MAF](https://huggingface.co/HKUSTAudio/AudioX-MAF) model checkpoints for use with [ComfyUI-FFMPEGA](https://github.com/AEmotionStudio/ComfyUI-FFMPEGA).
AudioX is a unified anything-to-audio model from ICLR 2026 that supports text-to-audio, text-to-music, video-to-audio, and audio inpainting.
## Models
| File | Description | Size |
|------|-------------|------|
| `model.safetensors` | AudioX-MAF DiT model (full precision) | 5.19 GB |
| `synchformer_state_dict.safetensors` | Synchformer temporal encoder (shared with MMAudio) | 475 MB |
| `config.json` | Model architecture configuration | 3.3 KB |
## Sources
All models were downloaded from their **original sources** and converted by us:
- **AudioX-MAF**: [HKUSTAudio/AudioX-MAF](https://huggingface.co/HKUSTAudio/AudioX-MAF)
- **Synchformer**: Shared with [MMAudio](https://huggingface.co/hkchengrex/MMAudio)
## Usage
These models are automatically downloaded by the `generate_music` and `audio_inpaint` skills in ComfyUI-FFMPEGA. No manual setup needed.
**Manual installation:**
```
ComfyUI/models/audiox/
├── model.safetensors
├── synchformer_state_dict.safetensors
└── config.json
```
> **Note:** The `synchformer_state_dict.safetensors` is shared with MMAudio. If you already have it in `ComfyUI/models/mmaudio/`, AudioX will reuse it automatically — no duplicate download needed.
## Capabilities
| Skill | Description |
|-------|-------------|
| `generate_music` | Text-to-music and video-to-music generation |
| `audio_inpaint` | Fill gaps, extend, or regenerate sections of audio |
## License
> ⚠️ **CC-BY-NC 4.0** — AudioX model weights are licensed under [Creative Commons Attribution-NonCommercial 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
> Commercial use of the models is restricted. The code that loads/runs them is GPL-3.0.
## Paper
*AudioX: Diffusion Transformer for Anything-to-Audio Generation* (ICLR 2026)
[arXiv:2503.10522](https://arxiv.org/abs/2503.10522)
|