ACE-Step v1.5 1D VAE
Stable Audio Tools Format
GitHub | Project | Hugging Face | Space Demo | Discord | Tech Report
Model Details
This is the 1D Variational Autoencoder (VAE) used in ACE-Step v1.5 for music generation. The weights are provided in stable-audio-tools compatible format, making it easy to load, fine-tune, and integrate into your own training pipelines.
| Parameter | Value |
|---|---|
| Architecture | Oobleck Autoencoder (VAE) |
| Audio Channels | 2 (Stereo) |
| Sampling Rate | 48,000 Hz |
| Latent Dim | 64 |
| Encoder Latent Dim | 128 |
| Downsampling Ratio | 1,920 |
| Encoder/Decoder Channels | 128 |
| Channel Multipliers | [1, 2, 4, 8, 16] |
| Strides | [2, 4, 4, 6, 10] |
| Activation | Snake |
ποΈ Architecture
The VAE is a core component of the ACE-Step v1.5 pipeline, responsible for compressing raw stereo audio (48kHz) into a compact latent representation with a 1920x downsampling ratio and 64-dimensional latent space. The DiT operates in this latent space to generate music.
Quick Start
Installation
pip install stable-audio-tools torchaudio
Load and Use
from stable_audio_vae import StableAudioVAE
# Load model
vae = StableAudioVAE(
config_path="config.json",
checkpoint_path="checkpoint.ckpt",
)
vae = vae.cuda().eval()
# Encode audio
wav = vae.load_wav("input.wav")
wav = wav.cuda()
latent = vae.encode(wav)
print(f"Latent shape: {latent.shape}") # [batch, 64, time/1920]
# Decode back to audio
output = vae.decode(latent)
Command Line
python stable_audio_vae.py -i input.wav -o output.wav
# For long audio, use chunked processing
python stable_audio_vae.py -i input.wav -o output.wav --chunked
Fine-Tuning
This checkpoint is compatible with stable-audio-tools training pipelines. The config.json includes full training configuration (optimizer, loss, discriminator settings) that you can use as a starting point for fine-tuning.
File Structure
.
βββ config.json # Model architecture and training config
βββ checkpoint.ckpt # Model weights (PyTorch checkpoint)
βββ stable_audio_vae.py # Inference script with StableAudioVAE wrapper
βββ README.md
π¦ Related Models
| Model | Description | Hugging Face |
|---|---|---|
acestep-v15-base |
DiT base model (CFG, 50 steps) | Link |
acestep-v15-sft |
DiT SFT model (CFG, 50 steps) | Link |
acestep-v15-turbo |
DiT turbo model (8 steps) | Link |
acestep-v15-xl-base |
XL DiT base (4B, CFG, 50 steps) | Link |
acestep-v15-xl-sft |
XL DiT SFT (4B, CFG, 50 steps) | Link |
acestep-v15-xl-turbo |
XL DiT turbo (4B, 8 steps) | Link |
π Acknowledgements
This project is co-led by ACE Studio and StepFun.
π Citation
If you find this project useful for your research, please consider citing:
@misc{gong2026acestep,
title={ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
author={Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
howpublished={\url{https://github.com/ace-step/ACE-Step-1.5}},
year={2026},
note={GitHub repository}
}
- Downloads last month
- -