Text-to-Audio
Diffusers
Safetensors
ACE-Step
AceStepPipeline
audio
music
text-to-music
flow-matching
Instructions to use ACE-Step/acestep-v15-xl-base-diffusers with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use ACE-Step/acestep-v15-xl-base-diffusers with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("ACE-Step/acestep-v15-xl-base-diffusers", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - ACE-Step
How to use ACE-Step/acestep-v15-xl-base-diffusers with ACE-Step:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: mit | |
| library_name: diffusers | |
| pipeline_tag: text-to-audio | |
| tags: | |
| - diffusers | |
| - acestep | |
| - audio | |
| - music | |
| - text-to-music | |
| - flow-matching | |
| base_model: ACE-Step/acestep-v15-xl-base | |
| # ACE-Step v1.5 XL Base Diffusers | |
| Diffusers-format checkpoint of [ACE-Step v1.5 XL Base](https://huggingface.co/ACE-Step/acestep-v15-xl-base) - the base 5B-parameter flow-matching DiT for text-to-music generation (`hidden_size=2560`, 32 layers, 32 heads; `encoder_hidden_size=2048` on the condition encoder). | |
| This repository is the official Diffusers-format version of the ACE-Step v1.5 XL Base checkpoint. It can be loaded directly with `AceStepPipeline`, which is available in `huggingface/diffusers`. | |
| Weights are produced by `scripts/convert_ace_step_to_diffusers.py` from the upstream release and packaged in the standard Diffusers pipeline layout (`model_index.json` + one subdirectory per module), so the full pipeline can be loaded in a single `from_pretrained` call. | |
| ## Usage | |
| Install Diffusers from source until the next package release includes `AceStepPipeline`. | |
| ```bash | |
| pip install git+https://github.com/huggingface/diffusers.git | |
| ``` | |
| ```python | |
| import torch | |
| import soundfile as sf | |
| from diffusers import AceStepPipeline | |
| pipe = AceStepPipeline.from_pretrained( | |
| "ACE-Step/acestep-v15-xl-base-diffusers", | |
| torch_dtype=torch.bfloat16, | |
| ) | |
| pipe = pipe.to("cuda") | |
| # Long-form audio: enable VAE tiling to keep decode memory bounded. | |
| pipe.vae.enable_tiling() | |
| output = pipe( | |
| prompt="An upbeat synthwave track with driving drums and a catchy lead", | |
| lyrics="[Verse]\nNeon lights are calling me\n[Chorus]\nRide the wave tonight", | |
| audio_duration=30.0, | |
| num_inference_steps=50, | |
| guidance_scale=7.0, | |
| shift=3.0, | |
| generator=torch.Generator(device="cuda").manual_seed(42), | |
| ) | |
| audio = output.audios[0] # (channels, samples), 48 kHz | |
| sf.write("acestep-xl-base.wav", audio.T.cpu().float().numpy(), pipe.sample_rate) | |
| ``` | |
| Unlike the turbo checkpoint, XL Base is not guidance-distilled. The pipeline uses ACE-Step's APG guidance path when `guidance_scale > 1.0`; `num_inference_steps=50`, `guidance_scale=7.0`, and `shift=3.0` are the recommended defaults. Pass `num_inference_steps=50` explicitly so generation does not use the lower-step turbo setting. | |
| For batched prompts with padding and FlashAttention, use the variable-length backend: | |
| ```python | |
| pipe.transformer.set_attention_backend("flash_varlen") | |
| pipe.condition_encoder.set_attention_backend("flash_varlen") | |
| ``` | |
| For single-prompt generation, the regular `flash` backend is also suitable. | |
| ## Repository layout | |
| ``` | |
| βββ model_index.json | |
| βββ transformer/ # AceStepTransformer1DModel (DiT, 5B params, bf16) | |
| βββ condition_encoder/ # AceStepConditionEncoder (with baked-in silence_latent) | |
| βββ audio_tokenizer/ # AceStepAudioTokenizer | |
| βββ audio_token_detokenizer/ # AceStepAudioTokenDetokenizer | |
| βββ vae/ # AutoencoderOobleck (48 kHz stereo) | |
| βββ text_encoder/ # Qwen3-Embedding-0.6B | |
| βββ tokenizer/ # Qwen3 tokenizer | |
| βββ scheduler/ # FlowMatchEulerDiscreteScheduler config | |
| βββ silence_latent.pt # Raw reference (kept for debugging; not needed at runtime) | |
| ``` | |
| ## License | |
| - ACE-Step weights: MIT (same as [upstream](https://huggingface.co/ACE-Step/acestep-v15-xl-base)) | |
| - `text_encoder/` (Qwen3-Embedding-0.6B): Apache 2.0 - redistributed per Qwen's license | |
| ## Citation | |
| ``` | |
| @misc{gong2026acestep, | |
| title = {ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation}, | |
| author = {Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo}, | |
| howpublished = {\url{https://github.com/ace-step/ACE-Step-1.5}}, | |
| year = {2026}, | |
| note = {GitHub repository} | |
| } | |
| ``` | |