Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
tags:
|
| 4 |
+
- audiox
|
| 5 |
+
- audio-generation
|
| 6 |
+
- music-generation
|
| 7 |
+
- text-to-audio
|
| 8 |
+
- video-to-audio
|
| 9 |
+
- audio-inpainting
|
| 10 |
+
- safetensors
|
| 11 |
+
base_model:
|
| 12 |
+
- HKUSTAudio/AudioX
|
| 13 |
+
- HKUSTAudio/AudioX-MAF
|
| 14 |
+
pipeline_tag: text-to-audio
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# AudioX Models (Safetensors)
|
| 18 |
+
|
| 19 |
+
`.safetensors` conversions of [AudioX-MAF](https://huggingface.co/HKUSTAudio/AudioX-MAF) model checkpoints for use with [ComfyUI-FFMPEGA](https://github.com/AEmotionStudio/ComfyUI-FFMPEGA).
|
| 20 |
+
|
| 21 |
+
AudioX is a unified anything-to-audio model from ICLR 2026 that supports text-to-audio, text-to-music, video-to-audio, and audio inpainting.
|
| 22 |
+
|
| 23 |
+
## Models
|
| 24 |
+
|
| 25 |
+
| File | Description | Size |
|
| 26 |
+
|------|-------------|------|
|
| 27 |
+
| `model.safetensors` | AudioX-MAF DiT model (full precision) | 5.19 GB |
|
| 28 |
+
| `synchformer_state_dict.safetensors` | Synchformer temporal encoder (shared with MMAudio) | 475 MB |
|
| 29 |
+
| `config.json` | Model architecture configuration | 3.3 KB |
|
| 30 |
+
|
| 31 |
+
## Sources
|
| 32 |
+
|
| 33 |
+
All models were downloaded from their **original sources** and converted by us:
|
| 34 |
+
|
| 35 |
+
- **AudioX-MAF**: [HKUSTAudio/AudioX-MAF](https://huggingface.co/HKUSTAudio/AudioX-MAF)
|
| 36 |
+
- **Synchformer**: Shared with [MMAudio](https://huggingface.co/hkchengrex/MMAudio)
|
| 37 |
+
|
| 38 |
+
## Usage
|
| 39 |
+
|
| 40 |
+
These models are automatically downloaded by the `generate_music` and `audio_inpaint` skills in ComfyUI-FFMPEGA. No manual setup needed.
|
| 41 |
+
|
| 42 |
+
**Manual installation:**
|
| 43 |
+
```
|
| 44 |
+
ComfyUI/models/audiox/
|
| 45 |
+
├── model.safetensors
|
| 46 |
+
├── synchformer_state_dict.safetensors
|
| 47 |
+
└── config.json
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
> **Note:** The `synchformer_state_dict.safetensors` is shared with MMAudio. If you already have it in `ComfyUI/models/mmaudio/`, AudioX will reuse it automatically — no duplicate download needed.
|
| 51 |
+
|
| 52 |
+
## Capabilities
|
| 53 |
+
|
| 54 |
+
| Skill | Description |
|
| 55 |
+
|-------|-------------|
|
| 56 |
+
| `generate_music` | Text-to-music and video-to-music generation |
|
| 57 |
+
| `audio_inpaint` | Fill gaps, extend, or regenerate sections of audio |
|
| 58 |
+
|
| 59 |
+
## License
|
| 60 |
+
|
| 61 |
+
> ⚠️ **CC-BY-NC 4.0** — AudioX model weights are licensed under [Creative Commons Attribution-NonCommercial 4.0](https://creativecommons.org/licenses/by-nc/4.0/).
|
| 62 |
+
> Commercial use of the models is restricted. The code that loads/runs them is GPL-3.0.
|
| 63 |
+
|
| 64 |
+
## Paper
|
| 65 |
+
|
| 66 |
+
*AudioX: Diffusion Transformer for Anything-to-Audio Generation* (ICLR 2026)
|
| 67 |
+
[arXiv:2503.10522](https://arxiv.org/abs/2503.10522)
|