--- license: mit library_name: diffusers pipeline_tag: text-to-audio tags: - diffusers - acestep - audio - music - text-to-music - flow-matching base_model: ACE-Step/acestep-v15-xl-base --- # ACE-Step v1.5 XL Base Diffusers Diffusers-format checkpoint of [ACE-Step v1.5 XL Base](https://huggingface.co/ACE-Step/acestep-v15-xl-base) - the base 5B-parameter flow-matching DiT for text-to-music generation (`hidden_size=2560`, 32 layers, 32 heads; `encoder_hidden_size=2048` on the condition encoder). This repository is the official Diffusers-format version of the ACE-Step v1.5 XL Base checkpoint. It can be loaded directly with `AceStepPipeline`, which is available in `huggingface/diffusers`. Weights are produced by `scripts/convert_ace_step_to_diffusers.py` from the upstream release and packaged in the standard Diffusers pipeline layout (`model_index.json` + one subdirectory per module), so the full pipeline can be loaded in a single `from_pretrained` call. ## Usage Install Diffusers from source until the next package release includes `AceStepPipeline`. ```bash pip install git+https://github.com/huggingface/diffusers.git ``` ```python import torch import soundfile as sf from diffusers import AceStepPipeline pipe = AceStepPipeline.from_pretrained( "ACE-Step/acestep-v15-xl-base-diffusers", torch_dtype=torch.bfloat16, ) pipe = pipe.to("cuda") # Long-form audio: enable VAE tiling to keep decode memory bounded. pipe.vae.enable_tiling() output = pipe( prompt="An upbeat synthwave track with driving drums and a catchy lead", lyrics="[Verse]\nNeon lights are calling me\n[Chorus]\nRide the wave tonight", audio_duration=30.0, num_inference_steps=50, guidance_scale=7.0, shift=3.0, generator=torch.Generator(device="cuda").manual_seed(42), ) audio = output.audios[0] # (channels, samples), 48 kHz sf.write("acestep-xl-base.wav", audio.T.cpu().float().numpy(), pipe.sample_rate) ``` Unlike the turbo checkpoint, XL Base is not guidance-distilled. The pipeline uses ACE-Step's APG guidance path when `guidance_scale > 1.0`; `num_inference_steps=50`, `guidance_scale=7.0`, and `shift=3.0` are the recommended defaults. Pass `num_inference_steps=50` explicitly so generation does not use the lower-step turbo setting. For batched prompts with padding and FlashAttention, use the variable-length backend: ```python pipe.transformer.set_attention_backend("flash_varlen") pipe.condition_encoder.set_attention_backend("flash_varlen") ``` For single-prompt generation, the regular `flash` backend is also suitable. ## Repository layout ``` ├── model_index.json ├── transformer/ # AceStepTransformer1DModel (DiT, 5B params, bf16) ├── condition_encoder/ # AceStepConditionEncoder (with baked-in silence_latent) ├── audio_tokenizer/ # AceStepAudioTokenizer ├── audio_token_detokenizer/ # AceStepAudioTokenDetokenizer ├── vae/ # AutoencoderOobleck (48 kHz stereo) ├── text_encoder/ # Qwen3-Embedding-0.6B ├── tokenizer/ # Qwen3 tokenizer ├── scheduler/ # FlowMatchEulerDiscreteScheduler config └── silence_latent.pt # Raw reference (kept for debugging; not needed at runtime) ``` ## License - ACE-Step weights: MIT (same as [upstream](https://huggingface.co/ACE-Step/acestep-v15-xl-base)) - `text_encoder/` (Qwen3-Embedding-0.6B): Apache 2.0 - redistributed per Qwen's license ## Citation ``` @misc{gong2026acestep, title = {ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation}, author = {Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo}, howpublished = {\url{https://github.com/ace-step/ACE-Step-1.5}}, year = {2026}, note = {GitHub repository} } ```