audiox-models / README.md
AEmotionStudio's picture
Upload README.md with huggingface_hub
9edfe69 verified
metadata
license: cc-by-nc-4.0
tags:
  - audiox
  - audio-generation
  - music-generation
  - text-to-audio
  - video-to-audio
  - audio-inpainting
  - safetensors
base_model:
  - HKUSTAudio/AudioX
  - HKUSTAudio/AudioX-MAF
pipeline_tag: text-to-audio

AudioX Models (Safetensors)

.safetensors conversions of AudioX-MAF model checkpoints for use with ComfyUI-FFMPEGA.

AudioX is a unified anything-to-audio model from ICLR 2026 that supports text-to-audio, text-to-music, video-to-audio, and audio inpainting.

Models

File Description Size
model.safetensors AudioX-MAF DiT model (full precision) 5.19 GB
synchformer_state_dict.safetensors Synchformer temporal encoder (shared with MMAudio) 475 MB
config.json Model architecture configuration 3.3 KB

Sources

All models were downloaded from their original sources and converted by us:

Usage

These models are automatically downloaded by the generate_music and audio_inpaint skills in ComfyUI-FFMPEGA. No manual setup needed.

Manual installation:

ComfyUI/models/audiox/
├── model.safetensors
├── synchformer_state_dict.safetensors
└── config.json

Note: The synchformer_state_dict.safetensors is shared with MMAudio. If you already have it in ComfyUI/models/mmaudio/, AudioX will reuse it automatically — no duplicate download needed.

Capabilities

Skill Description
generate_music Text-to-music and video-to-music generation
audio_inpaint Fill gaps, extend, or regenerate sections of audio

License

⚠️ CC-BY-NC 4.0 — AudioX model weights are licensed under Creative Commons Attribution-NonCommercial 4.0. Commercial use of the models is restricted. The code that loads/runs them is GPL-3.0.

Paper

AudioX: Diffusion Transformer for Anything-to-Audio Generation (ICLR 2026) arXiv:2503.10522