You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
By clicking "Agree", you agree to the License Agreement and acknowledge Stability AI's Privacy Policy. This model also includes components redistributed under the Gemma Terms of Use. By proceeding, you agree to those terms as well, including the use restrictions in Section 3.2.
Log in or Sign Up to review the conditions and access this model content.
Stable Audio 3 Small Music
Please note: For commercial use, please refer to https://stability.ai/license
Model Description
Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable length audio generation and editing. Since our models can generate several minutes of audio,
variable-length generations are key to avoid the cost of producing full-length generations for short
sounds. We also support inpainting, enabling targeted audio editing and the continuation of short
recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that
projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial
post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on
licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU
and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium,
that can run on consumer-grade hardware, together with their training and inference pipeline.
Usage
This model can be used with:
- the
stable-audio-3inference and fine-tuning library - the
stable-audio-toolsresearch library
### Using with `stable-audio-3`
from stable_audio_3 import StableAudioModel
model = StableAudioModel.from_pretrained("small-music")
audio = model.generate(
prompt=(
"House music that encapsulates the feeling of being at a festival "
"in the sunny weather with all your friends 124 BPM"
),
duration=120
)
Using with stable-audio-tools
import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond_inpaint
device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
model_half = True
# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-3-small-music")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]
model = model.to(device)
if model_half:
model = model.to(torch.float16)
# Set up text and timing conditioning
conditioning = [{
"prompt": (
"A dream-like Synthpop instrumental that would accompany "
"a dream-sequence in a surrealist movie 120 BPM"
),
"seconds_total": 120
}]
# Generate stereo audio
output = generate_diffusion_cond_inpaint(
model,
steps=8,
cfg_scale=1.0,
conditioning=conditioning,
sample_size=sample_size,
sampler_type="pingpong",
device=device
)
# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")
# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)
Model Details
- Model type:
Stable Audio 3is a latent diffusion model based on a transformer architecture. - Language(s): English
- License: Stability AI Community License.
- Research Paper: https://arxiv.org/abs/2605.17991
We use a publicly available pre-trained T5Gemma model (t5gemma-b-b-ul2) for text conditioning. T5Gemma is redistributed under the Gemma Terms of Use.
Training dataset
Datasets Used
Our dataset consists of 1,278,902 audio recordings, where 806,284 recordings are licensed from AudioSparx and a further 472,618 are from Freesound. The Freesound portion consists of recordings licensed under CC-0, CC-BY, or CCSampling+. To ensure no copyrighted content was present in the Freesound data, music recordings were identified using the PANNs [89] tagger. We flagged audio that activated music-related tags for at least 30s (threshold of 0.15), that was sent to a trusted content detection company to verify the absence of copyrighted material. All identified copyrighted content was removed. After filtering, the Freesound part includes 266,324 CC-0, 194,840 CC-BY, and 11,454 CC-Sampling+ recordings. The same subset of Freesound audio we used to train Stable Audio Open: https://info.stability.ai/attributions.
Model tree for stabilityai/stable-audio-3-small-music
Base model
stabilityai/stable-audio-3-small-music-base