Text-to-Audio
Diffusers
Safetensors
ACE-Step
AceStepPipeline
audio
music
text-to-music
flow-matching
Instructions to use ACE-Step/acestep-v15-xl-base-diffusers with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use ACE-Step/acestep-v15-xl-base-diffusers with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("ACE-Step/acestep-v15-xl-base-diffusers", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - ACE-Step
How to use ACE-Step/acestep-v15-xl-base-diffusers with ACE-Step:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
File size: 3,782 Bytes
05791af 8f40447 05791af 8f40447 05791af | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | ---
license: mit
library_name: diffusers
pipeline_tag: text-to-audio
tags:
- diffusers
- acestep
- audio
- music
- text-to-music
- flow-matching
base_model: ACE-Step/acestep-v15-xl-base
---
# ACE-Step v1.5 XL Base Diffusers
Diffusers-format checkpoint of [ACE-Step v1.5 XL Base](https://huggingface.co/ACE-Step/acestep-v15-xl-base) - the base 5B-parameter flow-matching DiT for text-to-music generation (`hidden_size=2560`, 32 layers, 32 heads; `encoder_hidden_size=2048` on the condition encoder).
This repository is the official Diffusers-format version of the ACE-Step v1.5 XL Base checkpoint. It can be loaded directly with `AceStepPipeline`, which is available in `huggingface/diffusers`.
Weights are produced by `scripts/convert_ace_step_to_diffusers.py` from the upstream release and packaged in the standard Diffusers pipeline layout (`model_index.json` + one subdirectory per module), so the full pipeline can be loaded in a single `from_pretrained` call.
## Usage
Install Diffusers from source until the next package release includes `AceStepPipeline`.
```bash
pip install git+https://github.com/huggingface/diffusers.git
```
```python
import torch
import soundfile as sf
from diffusers import AceStepPipeline
pipe = AceStepPipeline.from_pretrained(
"ACE-Step/acestep-v15-xl-base-diffusers",
torch_dtype=torch.bfloat16,
)
pipe = pipe.to("cuda")
# Long-form audio: enable VAE tiling to keep decode memory bounded.
pipe.vae.enable_tiling()
output = pipe(
prompt="An upbeat synthwave track with driving drums and a catchy lead",
lyrics="[Verse]\nNeon lights are calling me\n[Chorus]\nRide the wave tonight",
audio_duration=30.0,
num_inference_steps=50,
guidance_scale=7.0,
shift=3.0,
generator=torch.Generator(device="cuda").manual_seed(42),
)
audio = output.audios[0] # (channels, samples), 48 kHz
sf.write("acestep-xl-base.wav", audio.T.cpu().float().numpy(), pipe.sample_rate)
```
Unlike the turbo checkpoint, XL Base is not guidance-distilled. The pipeline uses ACE-Step's APG guidance path when `guidance_scale > 1.0`; `num_inference_steps=50`, `guidance_scale=7.0`, and `shift=3.0` are the recommended defaults. Pass `num_inference_steps=50` explicitly so generation does not use the lower-step turbo setting.
For batched prompts with padding and FlashAttention, use the variable-length backend:
```python
pipe.transformer.set_attention_backend("flash_varlen")
pipe.condition_encoder.set_attention_backend("flash_varlen")
```
For single-prompt generation, the regular `flash` backend is also suitable.
## Repository layout
```
βββ model_index.json
βββ transformer/ # AceStepTransformer1DModel (DiT, 5B params, bf16)
βββ condition_encoder/ # AceStepConditionEncoder (with baked-in silence_latent)
βββ audio_tokenizer/ # AceStepAudioTokenizer
βββ audio_token_detokenizer/ # AceStepAudioTokenDetokenizer
βββ vae/ # AutoencoderOobleck (48 kHz stereo)
βββ text_encoder/ # Qwen3-Embedding-0.6B
βββ tokenizer/ # Qwen3 tokenizer
βββ scheduler/ # FlowMatchEulerDiscreteScheduler config
βββ silence_latent.pt # Raw reference (kept for debugging; not needed at runtime)
```
## License
- ACE-Step weights: MIT (same as [upstream](https://huggingface.co/ACE-Step/acestep-v15-xl-base))
- `text_encoder/` (Qwen3-Embedding-0.6B): Apache 2.0 - redistributed per Qwen's license
## Citation
```
@misc{gong2026acestep,
title = {ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
author = {Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
howpublished = {\url{https://github.com/ace-step/ACE-Step-1.5}},
year = {2026},
note = {GitHub repository}
}
```
|