File size: 2,291 Bytes
9edfe69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
license: cc-by-nc-4.0
tags:
- audiox
- audio-generation
- music-generation
- text-to-audio
- video-to-audio
- audio-inpainting
- safetensors
base_model:
- HKUSTAudio/AudioX
- HKUSTAudio/AudioX-MAF
pipeline_tag: text-to-audio
---

# AudioX Models (Safetensors)

`.safetensors` conversions of [AudioX-MAF](https://huggingface.co/HKUSTAudio/AudioX-MAF) model checkpoints for use with [ComfyUI-FFMPEGA](https://github.com/AEmotionStudio/ComfyUI-FFMPEGA).

AudioX is a unified anything-to-audio model from ICLR 2026 that supports text-to-audio, text-to-music, video-to-audio, and audio inpainting.

## Models

| File | Description | Size |
|------|-------------|------|
| `model.safetensors` | AudioX-MAF DiT model (full precision) | 5.19 GB |
| `synchformer_state_dict.safetensors` | Synchformer temporal encoder (shared with MMAudio) | 475 MB |
| `config.json` | Model architecture configuration | 3.3 KB |

## Sources

All models were downloaded from their **original sources** and converted by us:

- **AudioX-MAF**: [HKUSTAudio/AudioX-MAF](https://huggingface.co/HKUSTAudio/AudioX-MAF)
- **Synchformer**: Shared with [MMAudio](https://huggingface.co/hkchengrex/MMAudio)

## Usage

These models are automatically downloaded by the `generate_music` and `audio_inpaint` skills in ComfyUI-FFMPEGA. No manual setup needed.

**Manual installation:**
```
ComfyUI/models/audiox/
├── model.safetensors
├── synchformer_state_dict.safetensors
└── config.json
```

> **Note:** The `synchformer_state_dict.safetensors` is shared with MMAudio. If you already have it in `ComfyUI/models/mmaudio/`, AudioX will reuse it automatically — no duplicate download needed.

## Capabilities

| Skill | Description |
|-------|-------------|
| `generate_music` | Text-to-music and video-to-music generation |
| `audio_inpaint` | Fill gaps, extend, or regenerate sections of audio |

## License

> ⚠️ **CC-BY-NC 4.0** — AudioX model weights are licensed under [Creative Commons Attribution-NonCommercial 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
> Commercial use of the models is restricted. The code that loads/runs them is GPL-3.0.

## Paper

*AudioX: Diffusion Transformer for Anything-to-Audio Generation* (ICLR 2026)
[arXiv:2503.10522](https://arxiv.org/abs/2503.10522)