--- base_model: - HKUSTAudio/AudioX-Turbo license: cc-by-nc-4.0 pipeline_tag: text-to-audio library_name: stable-audio-tools tags: - audio-generation - music-generation - text-to-audio - video-to-audio - diffusion_cond - distillation arxiv: 2606.12555 --- # AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation **AudioX-Turbo** is a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals). It follows a *teacher–student* paradigm: the teacher **AudioX-Base** is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion (MAF) module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student **AudioX-Turbo** via Distribution Matching Distillation (DMD) adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. AudioX-Turbo generates audio in only **4 sampling steps** (no classifier-free guidance), requiring up to **~25×** fewer function evaluations (NFE) than multi-step baselines while achieving superior performance, especially on text-to-audio and text-to-music generation. - **Paper:** [AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation](https://arxiv.org/abs/2606.12555) - **Project Page:** https://zeyuet.github.io/AudioX-Turbo/ - **Repository:** https://github.com/NoizAI/AudioX-Turbo ## Files | File | Description | |:-----|:------------| | `audiox_turbo/audiox_turbo.ckpt` | AudioX-Turbo: distilled 4-step student model (inference) | | `pretransform/vae.ckpt` | VAE pretransform | | `synchformer/synchformer_state_dict.pth` | Synchformer, for video-conditioned (V2A/V2M) generation | | `pretrained_ckpt/pretrained_ckpt.ckpt` | Teacher / base model (training only: student init + teacher) | ## Download ```bash # Inference checkpoints (student + VAE + Synchformer) huggingface-cli download HKUSTAudio/AudioX-Turbo \ audiox_turbo/audiox_turbo.ckpt pretransform/vae.ckpt synchformer/synchformer_state_dict.pth \ --local-dir checkpoints # Training only: teacher / base model huggingface-cli download HKUSTAudio/AudioX-Turbo \ pretrained_ckpt/pretrained_ckpt.ckpt \ --local-dir checkpoints ``` ## Sample Usage To use this model programmatically, install the `audiox_turbo` package as specified in the [official repository](https://github.com/NoizAI/AudioX-Turbo). ```python import torch import torchaudio from einops import rearrange from audiox_turbo.inference import load_audiox_turbo_model from audiox_turbo.inference.generation import generate_diffusion_cond_dmd from audiox_turbo.data.utils import ( read_video, load_and_process_audio, encode_video_with_synchformer, merge_video_audio, ) device = "cuda" if torch.cuda.is_available() else "cpu" # Load the distilled 4-step student model, model_config = load_audiox_turbo_model( "configs/audiox_turbo_infer_4step.json", "checkpoints/audiox_turbo/audiox_turbo.ckpt", pretransform_ckpt_path="checkpoints/pretransform/vae.ckpt", device=device, ) sample_rate = model_config["sample_rate"] sample_size = model_config["sample_size"] target_fps = model_config.get("video_fps", 5) seconds_total = 10 # --- Choose a task by setting the inputs below --- # Text-to-Audio: video_path=None, text_prompt="Typing on a keyboard" # Video-to-Music: video_path="example/V2M_sample-1.mp4", text_prompt="Generate music for the video" video_path = "example/V2M_sample-1.mp4" text_prompt = "Generate music for the video" audio_path = None if video_path: video_tensor = read_video(video_path, seek_time=0, duration=seconds_total, target_fps=target_fps) sync_features = encode_video_with_synchformer(video_path, 0, seconds_total, device=device) else: video_tensor = torch.zeros(seconds_total * target_fps, 3, 224, 224) sync_features = torch.zeros(1, 240, 768, device=device) audio_tensor = load_and_process_audio(audio_path, sample_rate, 0, seconds_total) conditioning = [{ "video_prompt": {"video_tensors": video_tensor.unsqueeze(0), "video_sync_frames": sync_features}, "text_prompt": text_prompt or "", "audio_prompt": audio_tensor.unsqueeze(0), "seconds_start": 0, "seconds_total": seconds_total, }] # 4-step generation (no classifier-free guidance) output = generate_diffusion_cond_dmd( model, steps=4, conditioning=conditioning, sample_size=sample_size, seed=0, device=device, ) output = output[:, :, : sample_rate * seconds_total] output = rearrange(output, "b d n -> d (b n)") output = output.to(torch.float32).div(torch.max(torch.abs(output)).clamp_min(1e-8)).clamp(-1, 1) torchaudio.save("output.wav", output.cpu(), sample_rate) # Optional: mux the audio back onto the source video if video_path: merge_video_audio(video_path, "output.wav", "output.mp4", 0, seconds_total) ``` ## Supported Tasks AudioX-Turbo is a unified model that accepts text, video, and audio conditions in any combination: | Task | `video_path` | `text_prompt` | `audio_path` | |:---------------------|:-------------------|:----------------------------------------------|:-------------| | Text-to-Audio (T2A) | `None` | `"Typing on a keyboard"` | `None` | | Text-to-Music (T2M) | `None` | `"A music with piano and violin"` | `None` | | Video-to-Audio (V2A) | `"video_path.mp4"` | `"Generate general audio for the video"` | `None` | | Video-to-Music (V2M) | `"video_path.mp4"` | `"Generate music for the video"` | `None` | | TV-to-Audio (TV2A) | `"video_path.mp4"` | `"Ocean waves crashing with people laughing"` | `None` | | TV-to-Music (TV2M) | `"video_path.mp4"` | `"Generate music with piano instrument"` | `None` | ## Citation If you find our work useful, please consider citing: ```bibtex @article{tian2026audioxturbo, title={AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation}, author={Tian, Zeyue and Ke, Lei and Liu, Zhaoyang and Yuan, Ruibin and Xue, Liumeng and Yang, Yujiu and Chen, Weijia and Tan, Xu and Chen, Qifeng and Xue, Wei and Guo, Yike}, journal={arXiv preprint arXiv:2606.12555}, year={2026} } @inproceedings{tian2026audiox, title={AudioX: a unified framework for anything-to-audio generation}, author={Tian, Zeyue and Jin, Y and Liu, Z and others}, booktitle={Proceedings of the Fourteenth International Conference on Learning Representations}, year={2026} } ``` ## License This model is released under [CC-BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/). **Note:** The models are watermarked and are strictly for non-commercial use only.