Stable Audio 3 Small Music
Note: This is the base (pre-trained) model intended for fine-tuning. If you are looking to generate audio directly, please use Stable Audio 3 Small Music instead.
Please note: For commercial use, please refer to https://stability.ai/license
Model Description
Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable length audio generation and editing. Since our models can generate several minutes of audio,
variable-length generations are key to avoid the cost of producing full-length generations for short
sounds. We also support inpainting, enabling targeted audio editing and the continuation of short
recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that
projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial
post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on
licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU
and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium,
that can run on consumer-grade hardware, together with their training and inference pipeline.
Usage
This model can be used with:
- the
stable-audio-3inference and fine-tuning library - the
stable-audio-toolsresearch library
Using with stable-audio-3
from stable_audio_3 import StableAudioModel
model = StableAudioModel.from_pretrained("small-music")
audio = model.generate(
prompt=(
"House music that encapsulates the feeling of being at a festival "
"in the sunny weather with all your friends 124 BPM"
),
duration=120,
steps=50,
cfg_scale=7.0
)
Using with stable-audio-tools
import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond_inpaint
device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
model_half = True
# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-3-small-music")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]
model = model.to(device)
if model_half:
model = model.to(torch.float16)
# Set up text and timing conditioning
conditioning = [{
"prompt": (
"A dream-like Synthpop instrumental that would accompany "
"a dream-sequence in a surrealist movie 120 BPM"
),
"seconds_total": 120
}]
# Generate stereo audio
output = generate_diffusion_cond_inpaint(
model,
steps=50,
cfg_scale=7.0,
conditioning=conditioning,
sample_size=sample_size,
sampler_type="euler",
device=device
)
# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")
# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)
Model Details
- Model type:
Stable Audio 3is a latent diffusion model based on a transformer architecture. - Language(s): English
- License: Stability AI Community License.
- Research Paper: https://arxiv.org/abs/2605.17991
We use a publicly available pre-trained T5Gemma model (t5gemma-b-b-ul2) for text conditioning. T5Gemma is redistributed under the Gemma Terms of Use.
Training dataset
Datasets Used
Our dataset consists of 1,278,902 audio recordings, where 806,284 recordings are licensed from AudioSparx and a further 472,618 are from Freesound. The Freesound portion consists of recordings licensed under CC-0, CC-BY, or CCSampling+. To ensure no copyrighted content was present in the Freesound data, music recordings were identified using the PANNs [89] tagger. We flagged audio that activated music-related tags for at least 30s (threshold of 0.15), that was sent to a trusted content detection company to verify the absence of copyrighted material. All identified copyrighted content was removed. After filtering, the Freesound part includes 266,324 CC-0, 194,840 CC-BY, and 11,454 CC-Sampling+ recordings. The same subset of Freesound audio we used to train Stable Audio Open: https://info.stability.ai/attributions.