| | --- |
| | base_model: |
| | - HKUSTAudio/AudioX |
| | license: cc-by-nc-4.0 |
| | pipeline_tag: text-to-audio |
| | arxiv: 2503.10522 |
| | tags: |
| | - audio-generation |
| | - music-generation |
| | --- |
| | |
| | # AudioX: A Unified Framework for Anything-to-Audio Generation |
| |
|
| | AudioX is a unified framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals). The core design is a Multimodal Adaptive Fusion module, which enables the effective fusion of diverse multimodal inputs, enhancing cross-modal alignment and improving overall generation quality. |
| |
|
| | - **Paper:** [AudioX: A Unified Framework for Anything-to-Audio Generation](https://huggingface.co/papers/2503.10522) |
| | - **Project Page:** [https://zeyuet.github.io/AudioX/](https://zeyuet.github.io/AudioX/) |
| | - **Repository:** [https://github.com/ZeyueT/AudioX](https://github.com/ZeyueT/AudioX) |
| | - **Demo:** [Hugging Face Space](https://huggingface.co/spaces/Zeyue7/AudioX) |
| |
|
| | ## Sample Usage |
| |
|
| | To use this model programmatically, you can use the following script. Note that you need to install the `audiox` package as specified in the [official repository](https://github.com/ZeyueT/AudioX). |
| |
|
| | ```python |
| | import torch |
| | import torchaudio |
| | from einops import rearrange |
| | from audiox import get_pretrained_model |
| | from audiox.inference.generation import generate_diffusion_cond |
| | from audiox.data.utils import read_video, merge_video_audio, load_and_process_audio, encode_video_with_synchformer |
| | import os |
| | |
| | device = "cuda" if torch.cuda.is_available() else "cpu" |
| | |
| | # Load pretrained model |
| | # Choose one: "HKUSTAudio/AudioX", "HKUSTAudio/AudioX-MAF", or "HKUSTAudio/AudioX-MAF-MMDiT" |
| | model_name = "HKUSTAudio/AudioX" |
| | model, model_config = get_pretrained_model(model_name) |
| | sample_rate = model_config["sample_rate"] |
| | sample_size = model_config["sample_size"] |
| | target_fps = model_config["video_fps"] |
| | seconds_start = 0 |
| | seconds_total = 10 |
| | |
| | model = model.to(device) |
| | |
| | # Example: Video-to-Music generation |
| | video_path = "example/V2M_sample-1.mp4" |
| | text_prompt = "Generate music for the video" |
| | audio_path = None |
| | |
| | # Prepare inputs |
| | video_tensor = read_video(video_path, seek_time=seconds_start, duration=seconds_total, target_fps=target_fps) |
| | if audio_path: |
| | audio_tensor = load_and_process_audio(audio_path, sample_rate, seconds_start, seconds_total) |
| | else: |
| | # Use zero tensor when no audio is provided |
| | audio_tensor = torch.zeros((2, int(sample_rate * seconds_total))) |
| | |
| | # For AudioX-MAF and AudioX-MAF-MMDiT: encode video with synchformer |
| | video_sync_frames = None |
| | if "MAF" in model_name: |
| | video_sync_frames = encode_video_with_synchformer( |
| | video_path, model_name, seconds_start, seconds_total, device |
| | ) |
| | |
| | # Create conditioning |
| | conditioning = [{ |
| | "video_prompt": {"video_tensors": video_tensor.unsqueeze(0), "video_sync_frames": video_sync_frames}, |
| | "text_prompt": text_prompt, |
| | "audio_prompt": audio_tensor.unsqueeze(0), |
| | "seconds_start": seconds_start, |
| | "seconds_total": seconds_total |
| | }] |
| | |
| | # Generate audio |
| | output = generate_diffusion_cond( |
| | model, |
| | steps=250, |
| | cfg_scale=7, |
| | conditioning=conditioning, |
| | sample_size=sample_size, |
| | sigma_min=0.3, |
| | sigma_max=500, |
| | sampler_type="dpmpp-3m-sde", |
| | device=device |
| | ) |
| | |
| | # Post-process audio |
| | output = rearrange(output, "b d n -> d (b n)") |
| | output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu() |
| | torchaudio.save("output.wav", output, sample_rate) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{tian2025audiox, |
| | title={AudioX: Diffusion Transformer for Anything-to-Audio Generation}, |
| | author={Tian, Zeyue and Jin, Yizhu and Liu, Zhaoyang and Yuan, Ruibin and Tan, Xu and Chen, Qifeng and Xue, Wei and Guo, Yike}, |
| | journal={arXiv preprint arXiv:2503.10522}, |
| | year={2025} |
| | } |
| | ``` |