You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

By clicking "Agree", you agree to the License Agreement and acknowledge Stability AI's Privacy Policy. This model also includes components redistributed under the Gemma Terms of Use. By proceeding, you agree to those terms as well, including the use restrictions in Section 3.2.

Log in or Sign Up to review the conditions and access this model content.

Stable Audio 3 Medium

Please note: For commercial use, please refer to https://stability.ai/license

Model Description

Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.

Usage

This model can be used with:

  1. the stable-audio-3 inference and fine-tuning library
  2. the stable-audio-tools research library

Using with stable-audio-3

from stable_audio_3 import StableAudioModel

model = StableAudioModel.from_pretrained("medium")
audio = model.generate(
    prompt=(
        "House music that encapsulates the feeling of being at a festival "
        "in the sunny weather with all your friends 124 BPM"
    ),
    duration=180
)

Using with stable-audio-tools

import torch
import torchaudio
from einops import rearrange
from stable_audio_tools import get_pretrained_model
from stable_audio_tools.inference.generation import generate_diffusion_cond_inpaint

device = "cuda" if torch.cuda.is_available() else "cpu"
if device == "cuda":
  model_half = True

# Download model
model, model_config = get_pretrained_model("stabilityai/stable-audio-3-medium")
sample_rate = model_config["sample_rate"]
sample_size = model_config["sample_size"]

model = model.to(device)
if model_half:
  model = model.to(torch.float16)
# Set up text and timing conditioning
conditioning = [{
    "prompt": (
        "A dream-like Synthpop instrumental that would accompany "
        "a dream-sequence in a surrealist movie 120 BPM"
    ),
    "seconds_total": 380
}]

# Generate stereo audio
output = generate_diffusion_cond_inpaint(
    model,
    steps=8,
    cfg_scale=1.0,
    conditioning=conditioning,
    sample_size=sample_size,
    sampler_type="pingpong",
    device=device
)

# Rearrange audio batch to a single sequence
output = rearrange(output, "b d n -> d (b n)")

# Peak normalize, clip, convert to int16, and save to file
output = output.to(torch.float32).div(torch.max(torch.abs(output))).clamp(-1, 1).mul(32767).to(torch.int16).cpu()
torchaudio.save("output.wav", output, sample_rate)

Model Details

We use a publicly available pre-trained T5Gemma model (t5gemma-b-b-ul2) for text conditioning. T5Gemma is redistributed under the Gemma Terms of Use.

Training dataset

Datasets Used

Our dataset consists of 1,278,902 audio recordings, where 806,284 recordings are licensed from AudioSparx and a further 472,618 are from Freesound. The Freesound portion consists of recordings licensed under CC-0, CC-BY, or CCSampling+. To ensure no copyrighted content was present in the Freesound data, music recordings were identified using the PANNs [89] tagger. We flagged audio that activated music-related tags for at least 30s (threshold of 0.15), that was sent to a trusted content detection company to verify the absence of copyrighted material. All identified copyrighted content was removed. After filtering, the Freesound part includes 266,324 CC-0, 194,840 CC-BY, and 11,454 CC-Sampling+ recordings. The same subset of Freesound audio we used to train Stable Audio Open: https://info.stability.ai/attributions.

Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for stabilityai/stable-audio-3-medium

Finetuned
(1)
this model

Spaces using stabilityai/stable-audio-3-medium 2

Collection including stabilityai/stable-audio-3-medium

Paper for stabilityai/stable-audio-3-medium