audiox-models / README.md

AEmotionStudio

Upload README.md with huggingface_hub

9edfe69 verified 16 days ago

preview code

raw

history blame contribute delete

2.29 kB

metadata

license: cc-by-nc-4.0
tags:
  - audiox
  - audio-generation
  - music-generation
  - text-to-audio
  - video-to-audio
  - audio-inpainting
  - safetensors
base_model:
  - HKUSTAudio/AudioX
  - HKUSTAudio/AudioX-MAF
pipeline_tag: text-to-audio

AudioX Models (Safetensors)

.safetensors conversions of AudioX-MAF model checkpoints for use with ComfyUI-FFMPEGA.

AudioX is a unified anything-to-audio model from ICLR 2026 that supports text-to-audio, text-to-music, video-to-audio, and audio inpainting.

Models

File	Description	Size
`model.safetensors`	AudioX-MAF DiT model (full precision)	5.19 GB
`synchformer_state_dict.safetensors`	Synchformer temporal encoder (shared with MMAudio)	475 MB
`config.json`	Model architecture configuration	3.3 KB

Sources

All models were downloaded from their original sources and converted by us:

AudioX-MAF: HKUSTAudio/AudioX-MAF
Synchformer: Shared with MMAudio

Usage

These models are automatically downloaded by the generate_music and audio_inpaint skills in ComfyUI-FFMPEGA. No manual setup needed.

Manual installation:

ComfyUI/models/audiox/
├── model.safetensors
├── synchformer_state_dict.safetensors
└── config.json

Note: The synchformer_state_dict.safetensors is shared with MMAudio. If you already have it in ComfyUI/models/mmaudio/, AudioX will reuse it automatically — no duplicate download needed.

Capabilities

Skill	Description
`generate_music`	Text-to-music and video-to-music generation
`audio_inpaint`	Fill gaps, extend, or regenerate sections of audio

License

⚠️ CC-BY-NC 4.0 — AudioX model weights are licensed under Creative Commons Attribution-NonCommercial 4.0. Commercial use of the models is restricted. The code that loads/runs them is GPL-3.0.

Paper

AudioX: Diffusion Transformer for Anything-to-Audio Generation (ICLR 2026) arXiv:2503.10522