| --- |
| language: |
| - en |
| library_name: stable-audio-3 |
| license: other |
| license_name: stable-audio-community |
| license_link: LICENSE |
| tags: |
| - music |
| - sound-effects |
| - audio |
| - autoencoder |
| --- |
| |
| # SAME: A Semantically-Aligned Music Autoencoder |
|
|
| Please note: For commercial use, please refer to [https://stability.ai/license](https://stability.ai/license) |
|
|
| ## Model Description |
| Latent representations are at the heart of the majority of modern generative models. |
| In the audio domain they are typically produced by a neural-audio-codec autoencoder. |
| In this work we introduce SAME (Semantically Aligned Music autoEncoder), |
| a transformer-based autoencoder for stereo music and general audio that reaches a 4096x temporal compression ratio (roughly twice the current standard) |
| while maintaining excellent reconstruction quality and strong downstream generative performance. |
| We achieve this by combining a set of semantic regularisation approaches with phase-aware reconstruction losses. |
| The architecture also delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. |
| Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form. |
|
|
| ## Usage |
|
|
| This model can be used with: |
| 1. the [`stable-audio-3`](https://github.com/Stability-AI/stable-audio-3) inference and fine-tuning library |
| 2. the [`stable-audio-tools`](https://github.com/Stability-AI/stable-audio-tools) research library |
|
|
|
|
| ### Using with `stable-audio-3` |
| ```python |
| import torchaudio |
| from stable_audio_3 import AutoencoderModel |
| |
| ae = AutoencoderModel.from_pretrained("same-s") |
| waveform, sr = torchaudio.load("audio.wav") |
| latents = ae.encode(waveform, sr) |
| audio_out = ae.decode(latents) |
| ``` |
|
|
| ### Using with `stable-audio-tools` |
|
|
| ```python |
| import torch |
| import torchaudio |
| from einops import rearrange |
| from stable_audio_tools import get_pretrained_model |
| from stable_audio_tools.inference.generation import generate_diffusion_cond |
| |
| device = "cuda" if torch.cuda.is_available() else "cpu" |
| if device == "cuda": |
| model_half = True |
| |
| # Download model |
| model, model_config = get_pretrained_model("stabilityai/SAME-S") |
| sample_rate = model_config["sample_rate"] |
| sample_size = model_config["sample_size"] |
| |
| model = model.to(device) |
| if model_half: |
| model = model.to(torch.float16) |
| |
| audio, sr = torchaudio.load(/path/to/audiofile) # [channels, samples] |
| if audio.shape[0] == 1: |
| audio = audio.repeat(2, 1) |
| |
| audio = audio.unsqueeze(0).to(device) |
| if model_half: |
| audio = audio.half() |
| with torch.no_grad(): |
| latents = model.encode_audio(audio) |
| reconstructed = model.decode_audio(latents) |
| reconstructed = reconstructed.squeeze(0).cpu() |
| reconstructed = reconstructed.to(torch.float32).clamp(-1, 1).mul(32767).to(torch.int16).cpu() |
| |
| ``` |
|
|
|
|
| ## Model Details |
| * **Model type**: `SAME` is a continuous autoencoder model based on a transformer architecture. |
| * **Language(s)**: English |
| * **License**: [Stability AI Community License](https://stability.ai/license). |
| * **Research Paper**: [https://arxiv.org/abs/2605.18613](https://arxiv.org/abs/2605.18613) |
|
|
|
|
| ## Training dataset |
|
|
| ### Datasets Used |
| Our dataset consists of ~19,500 hours of licensed production audio from [AudioSparx](https://www.audiosparx.com/) which includes a 66/25/9% mix of music, sound effects, and instrument stems. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|