ACE-Step v1.5 1D VAE

Stable Audio Tools Format

GitHub | Project | Hugging Face | Space Demo | Discord | Tech Report

Model Details

This is the 1D Variational Autoencoder (VAE) used in ACE-Step v1.5 for music generation. The weights are provided in stable-audio-tools compatible format, making it easy to load, fine-tune, and integrate into your own training pipelines.

  • Developed by: ACE-STEP
  • Model type: Audio VAE (Oobleck Autoencoder)
  • License: MIT
Parameter Value
Architecture Oobleck Autoencoder (VAE)
Audio Channels 2 (Stereo)
Sampling Rate 48,000 Hz
Latent Dim 64
Encoder Latent Dim 128
Downsampling Ratio 1,920
Encoder/Decoder Channels 128
Channel Multipliers [1, 2, 4, 8, 16]
Strides [2, 4, 4, 6, 10]
Activation Snake

πŸ—οΈ Architecture

The VAE is a core component of the ACE-Step v1.5 pipeline, responsible for compressing raw stereo audio (48kHz) into a compact latent representation with a 1920x downsampling ratio and 64-dimensional latent space. The DiT operates in this latent space to generate music.

Quick Start

Installation

pip install stable-audio-tools torchaudio

Load and Use

from stable_audio_vae import StableAudioVAE

# Load model
vae = StableAudioVAE(
    config_path="config.json",
    checkpoint_path="checkpoint.ckpt",
)
vae = vae.cuda().eval()

# Encode audio
wav = vae.load_wav("input.wav")
wav = wav.cuda()
latent = vae.encode(wav)
print(f"Latent shape: {latent.shape}")  # [batch, 64, time/1920]

# Decode back to audio
output = vae.decode(latent)

Command Line

python stable_audio_vae.py -i input.wav -o output.wav

# For long audio, use chunked processing
python stable_audio_vae.py -i input.wav -o output.wav --chunked

Fine-Tuning

This checkpoint is compatible with stable-audio-tools training pipelines. The config.json includes full training configuration (optimizer, loss, discriminator settings) that you can use as a starting point for fine-tuning.

File Structure

.
β”œβ”€β”€ config.json            # Model architecture and training config
β”œβ”€β”€ checkpoint.ckpt        # Model weights (PyTorch checkpoint)
β”œβ”€β”€ stable_audio_vae.py    # Inference script with StableAudioVAE wrapper
└── README.md

🦁 Related Models

Model Description Hugging Face
acestep-v15-base DiT base model (CFG, 50 steps) Link
acestep-v15-sft DiT SFT model (CFG, 50 steps) Link
acestep-v15-turbo DiT turbo model (8 steps) Link
acestep-v15-xl-base XL DiT base (4B, CFG, 50 steps) Link
acestep-v15-xl-sft XL DiT SFT (4B, CFG, 50 steps) Link
acestep-v15-xl-turbo XL DiT turbo (4B, 8 steps) Link

πŸ™ Acknowledgements

This project is co-led by ACE Studio and StepFun.

πŸ“– Citation

If you find this project useful for your research, please consider citing:

@misc{gong2026acestep,
    title={ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
    author={Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo}, 
    howpublished={\url{https://github.com/ace-step/ACE-Step-1.5}},
    year={2026},
    note={GitHub repository}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for ACE-Step/ace-step-v1.5-1d-vae-stable-audio-format