Initial upload: ACE-Step v1.5 1D VAE (stable-audio-tools format)

Browse files

Files changed (4) hide show

README.md +130 -0
checkpoint.ckpt +3 -0
config.json +123 -0
stable_audio_vae.py +205 -0

README.md ADDED Viewed

	@@ -0,0 +1,130 @@

+---
+library_name: stable-audio-tools
+license: mit
+pipeline_tag: text-to-audio
+tags:
+- audio
+- music
+- vae
+- autoencoder
+- ace-step
+- stable-audio-tools
+---
+<h1 align="center">ACE-Step v1.5 1D VAE</h1>
+<h1 align="center">Stable Audio Tools Format</h1>
+<p align="center">
+    <a href="https://github.com/ACE-Step/ACE-Step-1.5">GitHub</a> |
+    <a href="https://ace-step.github.io/ace-step-v1.5.github.io/">Project</a> |
+    <a href="https://huggingface.co/collections/ACE-Step/ace-step-15">Hugging Face</a> |
+    <a href="https://huggingface.co/spaces/ACE-Step/Ace-Step-v1.5">Space Demo</a> |
+    <a href="https://discord.gg/PeWDxrkdj7">Discord</a> |
+    <a href="https://arxiv.org/abs/2602.00744">Tech Report</a>
+</p>
+## Model Details
+This is the 1D Variational Autoencoder (VAE) used in [ACE-Step v1.5](https://github.com/ACE-Step/ACE-Step-1.5) for music generation. The weights are provided in **[stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools)** compatible format, making it easy to load, fine-tune, and integrate into your own training pipelines.
+- **Developed by:** [ACE-STEP](https://github.com/ACE-Step)
+- **Model type:** Audio VAE (Oobleck Autoencoder)
+- **License:** [MIT](https://opensource.org/licenses/MIT)
+| Parameter | Value |
+|-----------|-------|
+| Architecture | Oobleck Autoencoder (VAE) |
+| Audio Channels | 2 (Stereo) |
+| Sampling Rate | 48,000 Hz |
+| Latent Dim | 64 |
+| Encoder Latent Dim | 128 |
+| Downsampling Ratio | 1,920 |
+| Encoder/Decoder Channels | 128 |
+| Channel Multipliers | [1, 2, 4, 8, 16] |
+| Strides | [2, 4, 4, 6, 10] |
+| Activation | Snake |
+## 🏗️ Architecture
+The VAE is a core component of the ACE-Step v1.5 pipeline, responsible for compressing raw stereo audio (48kHz) into a compact latent representation with a 1920x downsampling ratio and 64-dimensional latent space. The DiT operates in this latent space to generate music.
+## Quick Start
+### Installation
+```bash
+pip install stable-audio-tools torchaudio
+```
+### Load and Use
+```python
+from stable_audio_vae import StableAudioVAE
+# Load model
+vae = StableAudioVAE(
+    config_path="config.json",
+    checkpoint_path="checkpoint.ckpt",
+)
+vae = vae.cuda().eval()
+# Encode audio
+wav = vae.load_wav("input.wav")
+wav = wav.cuda()
+latent = vae.encode(wav)
+print(f"Latent shape: {latent.shape}")  # [batch, 64, time/1920]
+# Decode back to audio
+output = vae.decode(latent)
+```
+### Command Line
+```bash
+python stable_audio_vae.py -i input.wav -o output.wav
+# For long audio, use chunked processing
+python stable_audio_vae.py -i input.wav -o output.wav --chunked
+```
+## Fine-Tuning
+This checkpoint is compatible with [stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools) training pipelines. The `config.json` includes full training configuration (optimizer, loss, discriminator settings) that you can use as a starting point for fine-tuning.
+## File Structure
+```
+.
+├── config.json            # Model architecture and training config
+├── checkpoint.ckpt        # Model weights (PyTorch checkpoint)
+├── stable_audio_vae.py    # Inference script with StableAudioVAE wrapper
+└── README.md
+```
+## 🦁 Related Models
+| Model | Description | Hugging Face |
+|-------|-------------|--------------|
+| `acestep-v15-base` | DiT base model (CFG, 50 steps) | [Link](https://huggingface.co/ACE-Step/acestep-v15-base) |
+| `acestep-v15-sft` | DiT SFT model (CFG, 50 steps) | [Link](https://huggingface.co/ACE-Step/acestep-v15-sft) |
+| `acestep-v15-turbo` | DiT turbo model (8 steps) | [Link](https://huggingface.co/ACE-Step/Ace-Step1.5) |
+| `acestep-v15-xl-base` | XL DiT base (4B, CFG, 50 steps) | [Link](https://huggingface.co/ACE-Step/acestep-v15-xl-base) |
+| `acestep-v15-xl-sft` | XL DiT SFT (4B, CFG, 50 steps) | [Link](https://huggingface.co/ACE-Step/acestep-v15-xl-sft) |
+| `acestep-v15-xl-turbo` | XL DiT turbo (4B, 8 steps) | [Link](https://huggingface.co/ACE-Step/acestep-v15-xl-turbo) |
+## 🙏 Acknowledgements
+This project is co-led by ACE Studio and StepFun.
+## 📖 Citation
+If you find this project useful for your research, please consider citing:
+```BibTeX
+@misc{gong2026acestep,
+	title={ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
+	author={Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
+	howpublished={\url{https://github.com/ace-step/ACE-Step-1.5}},
+	year={2026},
+	note={GitHub repository}
+}
+```

checkpoint.ckpt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1575959a062145b8a36e4db420431d38748c82c7ba53ebe6742b073b9abf58b5
+size 674902910

config.json ADDED Viewed

	@@ -0,0 +1,123 @@

+{
+    "model_type": "autoencoder",
+    "sample_size": 122880,
+    "sample_rate": 48000,
+    "audio_channels": 2,
+    "model": {
+        "encoder": {
+            "type": "oobleck",
+            "config": {
+                "in_channels": 2,
+                "channels": 128,
+                "c_mults": [1, 2, 4, 8, 16],
+                "strides": [2, 4, 4, 6, 10],
+                "latent_dim": 128,
+                "use_snake": true
+            }
+        },
+        "decoder": {
+            "type": "oobleck",
+            "config": {
+                "out_channels": 2,
+                "channels": 128,
+                "c_mults": [1, 2, 4, 8, 16],
+                "strides": [2, 4, 4, 6, 10],
+                "latent_dim": 64,
+                "use_snake": true,
+                "final_tanh": false
+            }
+        },
+        "bottleneck": {
+            "type": "vae"
+        },
+        "latent_dim": 64,
+        "downsampling_ratio": 1920,
+        "io_channels": 2
+    },
+    "training": {
+        "learning_rate": 1.5e-4,
+        "warmup_steps": 0,
+        "encoder_freeze_on_warmup": true,
+        "use_ema": true,
+        "optimizer_configs": {
+            "autoencoder": {
+                "optimizer": {
+                    "type": "Muon",
+                    "config": {
+                        "betas": [0.8, 0.99],
+                        "lr": 1.5e-4,
+                        "weight_decay": 1e-3
+                    }
+                },
+                "scheduler": {
+                    "type": "InverseLR",
+                    "config": {
+                        "inv_gamma": 200000,
+                        "power": 0.5,
+                        "warmup": 0.999
+                    }
+                }
+            },
+            "discriminator": {
+                "optimizer": {
+                    "type": "Muon",
+                    "config": {
+                        "betas": [0.8, 0.99],
+                        "lr": 3e-4,
+                        "weight_decay": 1e-3
+                    }
+                },
+                "scheduler": {
+                    "type": "InverseLR",
+                    "config": {
+                        "inv_gamma": 200000,
+                        "power": 0.5,
+                        "warmup": 0.999
+                    }
+                }
+            }
+        },
+        "loss_configs": {
+            "discriminator": {
+                "type": "encodec",
+                "config": {
+                    "filters": 64,
+                    "n_ffts": [2048, 1024, 512, 256, 128],
+                    "hop_lengths": [512, 256, 128, 64, 32],
+                    "win_lengths": [2048, 1024, 512, 256, 128]
+                },
+                "weights": {
+                    "adversarial": 0.5,
+                    "feature_matching": 5.0
+                }
+            },
+            "spectral": {
+                "type": "mrstft",
+                "config": {
+                    "fft_sizes": [2048, 1024, 512, 256, 128, 64, 32],
+                    "hop_sizes": [512, 256, 128, 64, 32, 16, 8],
+                    "win_lengths": [2048, 1024, 512, 256, 128, 64, 32],
+                    "perceptual_weighting": true
+                },
+                "weights": {
+                    "mrstft": 1.0
+                }
+            },
+            "time": {
+                "type": "l1",
+                "weights": {
+                    "l1": 0.0
+                }
+            },
+            "bottleneck": {
+                "type": "kl",
+                "weights": {
+                    "kl": 0
+                }
+            }
+        },
+        "demo": {
+            "demo_every": 2000
+        }
+    }
+}

stable_audio_vae.py ADDED Viewed

	@@ -0,0 +1,205 @@

+import os
+import json
+import torch
+import torch.nn as nn
+from torch.nn.utils import remove_weight_norm, weight_norm
+import torchaudio
+from stable_audio_tools.models.autoencoders import create_autoencoder_from_config
+DEFAULT_ROOT = "./"
+DEFAULT_CONFIG_PATH = os.path.join(DEFAULT_ROOT, "config.json")
+DEFAULT_CHECKPOINT_PATH = os.path.join(DEFAULT_ROOT, "checkpoint.ckpt")
+def remove_weight_norm_(module):
+    """Recursively remove weight normalization from all submodules."""
+    for name, child in module.named_children():
+        if hasattr(child, "weight"):
+            try:
+                remove_weight_norm(child)
+            except ValueError:
+                pass
+        remove_weight_norm_(child)
+def add_weight_norm_(module):
+    """Recursively add weight normalization to all submodules."""
+    for name, child in module.named_children():
+        if hasattr(child, "weight"):
+            weight_norm(child)
+        add_weight_norm_(child)
+def prepare_audio(audio, in_sr, target_sr, target_length, target_channels, device):
+    """Resample, pad/crop, and set audio channels."""
+    audio = audio.to(device)
+    if in_sr != target_sr:
+        audio = torchaudio.functional.resample(
+            audio, orig_freq=in_sr, new_freq=target_sr
+        )
+    if target_length is None:
+        target_length = audio.shape[-1]
+    audio = PadCrop(target_length, randomize=False)(audio)
+    if audio.dim() == 1:
+        audio = audio.unsqueeze(0).unsqueeze(0)
+    elif audio.dim() == 2:
+        audio = audio.unsqueeze(0)
+    audio = set_audio_channels(audio, target_channels)
+    return audio
+class PadCrop(torch.nn.Module):
+    def __init__(self, n_samples, randomize=True):
+        super().__init__()
+        self.n_samples = n_samples
+        self.randomize = randomize
+    def __call__(self, signal):
+        n, s = signal.shape
+        start = 0 if (not self.randomize) else torch.randint(
+            0, max(0, s - self.n_samples) + 1, []
+        ).item()
+        end = start + self.n_samples
+        output = signal.new_zeros([n, self.n_samples])
+        output[:, :min(s, self.n_samples)] = signal[:, start:end]
+        return output
+def set_audio_channels(audio, target_channels):
+    if target_channels == 1:
+        audio = audio.mean(1, keepdim=True)
+    elif target_channels == 2:
+        if audio.shape[1] == 1:
+            audio = audio.repeat(1, 2, 1)
+        elif audio.shape[1] > 2:
+            audio = audio[:, :2, :]
+    return audio
+class StableAudioVAE(nn.Module):
+    def __init__(
+        self,
+        sampling_rate=48000,
+        config_path=DEFAULT_CONFIG_PATH,
+        checkpoint_path=DEFAULT_CHECKPOINT_PATH,
+        scale_factor=1.0,
+        shift_factor=0.0,
+        remove_norm=False,
+        overlap=32,
+        chunk_size=128,
+    ):
+        super(StableAudioVAE, self).__init__()
+        with open(config_path, "r") as f:
+            self.config = json.load(f)
+        self.vae = create_autoencoder_from_config(self.config)
+        # Load checkpoint - support both .ckpt (PyTorch) and .safetensors
+        if checkpoint_path.endswith(".safetensors"):
+            from safetensors.torch import load_file
+            checkpoints = load_file(checkpoint_path)
+        else:
+            checkpoints = torch.load(
+                checkpoint_path, map_location=torch.device("cpu")
+            )
+            if "state_dict" in checkpoints:
+                checkpoints = checkpoints["state_dict"]
+        # Strip "autoencoder." prefix if present
+        has_autoencoder = any(
+            k.startswith("autoencoder.") for k in checkpoints.keys()
+        )
+        if has_autoencoder:
+            checkpoints = {
+                k.replace("autoencoder.", ""): v
+                for k, v in checkpoints.items()
+                if k.startswith("autoencoder.")
+            }
+        self.vae.load_state_dict(checkpoints)
+        if remove_norm:
+            remove_weight_norm_(self.vae)
+        self.scale_factor = scale_factor
+        self.shift_factor = shift_factor
+        self.sampling_rate = sampling_rate
+        self.io_channels = self.config["audio_channels"]
+        self.overlap = overlap
+        self.chunk_size = chunk_size
+        self.downsampling_ratio = self.vae.downsampling_ratio
+        self.latent_dim = self.vae.latent_dim
+    def load_wav(self, path):
+        wav, sr = torchaudio.load(path)
+        wav = prepare_audio(
+            wav,
+            in_sr=sr,
+            target_sr=self.sampling_rate,
+            target_length=None,
+            target_channels=self.io_channels,
+            device="cpu",
+        )
+        return wav
+    @torch.no_grad()
+    def encode(self, wav, chunked=False):
+        if wav.shape[1] <= self.chunk_size * self.vae.downsampling_ratio:
+            chunked = False
+        latent = self.vae.encode_audio(wav, chunked=chunked)
+        latent = self.scale_factor * (latent - self.shift_factor)
+        return latent
+    @torch.no_grad()
+    def decode(self, z, chunked=False):
+        z = z / self.scale_factor + self.shift_factor
+        if z.shape[-1] <= self.chunk_size:
+            chunked = False
+        output = self.vae.decode_audio(z, chunked=chunked)
+        return output
+    @torch.no_grad()
+    def forward(self, wav, chunked=False):
+        """Encode and decode audio (reconstruction)."""
+        latent = self.vae.encode_audio(wav, chunked=chunked)
+        latent = self.scale_factor * (latent - self.shift_factor)
+        latent = latent / self.scale_factor + self.shift_factor
+        output = self.vae.decode_audio(latent, chunked=chunked)
+        return output
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser(description="Encode and decode audio with StableAudioVAE")
+    parser.add_argument("-m", "--model", type=str, default=DEFAULT_CHECKPOINT_PATH, help="path to checkpoint")
+    parser.add_argument("-c", "--config", type=str, default=DEFAULT_CONFIG_PATH, help="path to config.json")
+    parser.add_argument("-i", "--input", type=str, required=True, help="input audio file")
+    parser.add_argument("-o", "--output", type=str, required=True, help="output audio file")
+    parser.add_argument("-sr", "--sampling_rate", type=int, default=48000, help="sampling rate")
+    parser.add_argument("--chunked", action="store_true", help="use chunked processing for long audio")
+    args = parser.parse_args()
+    pipeline = StableAudioVAE(
+        sampling_rate=args.sampling_rate,
+        config_path=args.config,
+        checkpoint_path=args.model,
+    )
+    pipeline = pipeline.cuda()
+    wav = pipeline.load_wav(args.input)
+    wav = wav.cuda()
+    print(f"Input shape: {wav.shape}")
+    z = pipeline.encode(wav, chunked=args.chunked)
+    print(f"Latent shape: {z.shape}")
+    output = pipeline.decode(z, chunked=args.chunked)
+    print(f"Output shape: {output.shape}")
+    output = output[0].cpu()
+    torchaudio.save(args.output, output, pipeline.sampling_rate)
+    print(f"Saved to {args.output}")