Document 50-step default for XL SFT

4bf7b60 verified about 15 hours ago

3.79 kB

	---
	license: mit
	library_name: diffusers
	pipeline_tag: text-to-audio
	tags:
	- diffusers
	- acestep
	- audio
	- music
	- text-to-music
	- flow-matching
	base_model: ACE-Step/acestep-v15-xl-sft
	---

	# ACE-Step v1.5 XL SFT Diffusers

	Diffusers-format checkpoint of [ACE-Step v1.5 XL SFT](https://huggingface.co/ACE-Step/acestep-v15-xl-sft) - the supervised fine-tuned 5B-parameter flow-matching DiT for text-to-music generation (`hidden_size=2560`, 32 layers, 32 heads; `encoder_hidden_size=2048` on the condition encoder).

	This repository is the official Diffusers-format version of the ACE-Step v1.5 XL SFT checkpoint. It can be loaded directly with `AceStepPipeline`, which is available in `huggingface/diffusers`.

	Weights are produced by `scripts/convert_ace_step_to_diffusers.py` from the upstream release and packaged in the standard Diffusers pipeline layout (`model_index.json` + one subdirectory per module), so the full pipeline can be loaded in a single `from_pretrained` call.

	## Usage

	Install Diffusers from source until the next package release includes `AceStepPipeline`.

	```bash
	pip install git+https://github.com/huggingface/diffusers.git
	```

	```python
	import torch
	import soundfile as sf
	from diffusers import AceStepPipeline

	pipe = AceStepPipeline.from_pretrained(
	"ACE-Step/acestep-v15-xl-sft-diffusers",
	torch_dtype=torch.bfloat16,
	)
	pipe = pipe.to("cuda")

	# Long-form audio: enable VAE tiling to keep decode memory bounded.
	pipe.vae.enable_tiling()

	output = pipe(
	prompt="An upbeat synthwave track with driving drums and a catchy lead",
	lyrics="[Verse]\nNeon lights are calling me\n[Chorus]\nRide the wave tonight",
	audio_duration=30.0,
	num_inference_steps=50,
	guidance_scale=7.0,
	shift=3.0,
	generator=torch.Generator(device="cuda").manual_seed(42),
	)

	audio = output.audios[0] # (channels, samples), 48 kHz
	sf.write("acestep-xl-sft.wav", audio.T.cpu().float().numpy(), pipe.sample_rate)
	```

	Unlike the turbo checkpoint, XL SFT is not guidance-distilled. The pipeline uses ACE-Step's APG guidance path when `guidance_scale > 1.0`; `num_inference_steps=50`, `guidance_scale=7.0`, and `shift=3.0` are the recommended defaults. Pass `num_inference_steps=50` explicitly so generation does not use the lower-step turbo setting.

	For batched prompts with padding and FlashAttention, use the variable-length backend:

	```python
	pipe.transformer.set_attention_backend("flash_varlen")
	pipe.condition_encoder.set_attention_backend("flash_varlen")
	```

	For single-prompt generation, the regular `flash` backend is also suitable.

	## Repository layout

	```
	├── model_index.json
	├── transformer/ # AceStepTransformer1DModel (DiT, 5B params, bf16)
	├── condition_encoder/ # AceStepConditionEncoder (with baked-in silence_latent)
	├── audio_tokenizer/ # AceStepAudioTokenizer
	├── audio_token_detokenizer/ # AceStepAudioTokenDetokenizer
	├── vae/ # AutoencoderOobleck (48 kHz stereo)
	├── text_encoder/ # Qwen3-Embedding-0.6B
	├── tokenizer/ # Qwen3 tokenizer
	├── scheduler/ # FlowMatchEulerDiscreteScheduler config
	└── silence_latent.pt # Raw reference (kept for debugging; not needed at runtime)
	```

	## License

	- ACE-Step weights: MIT (same as [upstream](https://huggingface.co/ACE-Step/acestep-v15-xl-sft))
	- `text_encoder/` (Qwen3-Embedding-0.6B): Apache 2.0 - redistributed per Qwen's license

	## Citation

	```
	@misc{gong2026acestep,
	title = {ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
	author = {Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
	howpublished = {\url{https://github.com/ace-step/ACE-Step-1.5}},
	year = {2026},
	note = {GitHub repository}
	}
	```