Stable Audio 3 — bundled mirror

Self-contained inference bundle for the MAESTRO desktop app. One-to-one mirror of Stability AI's Stable Audio 3 collection and the extras collection (base checkpoints + standalone autoencoders), bundled into a single browseable HF repo so the MAESTRO panel can pick the variant a user wants without juggling eight separate downloads.

License — Stability AI Community License

All weights in this repository are released by Stability AI under the Stability AI Community License:

Free for organizations with under $1M annual revenue. Commercial use of the models and outputs is permitted within that threshold; redistribution, fine-tuning, and derivative works are explicitly allowed. Outputs are yours. Above the revenue threshold, contact Stability AI for an Enterprise License.

The upstream stable-audio-3 source code is released separately under MIT.

Gated subdirs

Three subdirs mirror upstream repos that are gated on huggingface.co — you must accept Stability AI's terms (and the Gemma terms-of-use, since the text encoder is T5-Gemma) before this mirror's gating allows access:

The base checkpoints and SAME autoencoders are open.

Contents

Subdir Role Params Max duration Upstream
small-music/ Post-trained text → audio (music) 433 M 120 s stabilityai/stable-audio-3-small-music (gated)
small-sfx/ Post-trained text → audio (SFX) 433 M 120 s stabilityai/stable-audio-3-small-sfx (gated)
medium/ Post-trained text → audio (music + SFX) 1.4 B 380 s stabilityai/stable-audio-3-medium (gated)
small-music-base/ Base ckpt for LoRA fine-tuning 433 M 120 s stabilityai/stable-audio-3-small-music-base
small-sfx-base/ Base ckpt for LoRA fine-tuning 433 M 120 s stabilityai/stable-audio-3-small-sfx-base
medium-base/ Base ckpt for LoRA fine-tuning 1.4 B 380 s stabilityai/stable-audio-3-medium-base
same-s/ SAME-Small standalone autoencoder ~50 M stabilityai/SAME-S
same-l/ SAME-Large standalone autoencoder ~200 M stabilityai/SAME-L

Every subdir contains model.safetensors + model_config.json (plus the post-trained / base variants include the bundled T5-Gemma text encoder + SAME pretransform; SAME repos are AE-only).

Capabilities

All six generative variants share a single inference surface in MAESTRO with four modes:

  • Text → Audio — prompt-only generation, stereo 44.1 kHz
  • Audio → Audio — style transfer / restyling with an adjustable init_noise_level
  • Inpaint — multi-region regeneration of a source clip; non-region time is preserved verbatim
  • Continue — extend an existing clip past its end

Generation knobs exposed: prompt, negative prompt, duration, steps, CFG scale, APG scale, seed, batch size, sampler type (dpmpp-3m-sde / dpmpp-2m / euler / heun), distribution shift (logSNR / flux / identity), precision (fp16 / fp32), chunked decode, and a user-loadable stackable LoRA stack.

Medium variants require Flash Attention 2 for the SAME-Large decoder path. Without flash-attn installed, Medium generation degrades to static-glitch output. Small variants do not require it.

Format

  • All weights are safetensors. No .pt / .ckpt / .bin in this mirror.
  • Mirror is bf16 — re-saved via safetensors.torch.save_model (preserves shared RotaryEmbedding buffers that bare save_file would corrupt). Bytewise this halves disk size vs the fp32 upstream. The MAESTRO runner upcasts to fp32 transiently during load_state_dict then casts to fp16 (model_half=True) for inference — runtime VRAM is unchanged from the fp32 mirror, but disk + I/O + initial safetensors-read CPU spike are all halved.
  • Approximate disk sizes per subdir: small variants ~1.14 GB each, medium variants ~4.61 GB each, SAME-S ~0.22 GB, SAME-L ~1.70 GB. Total mirror footprint ≈ 15.7 GB.

Usage

Inside MAESTRO

The MAESTRO desktop app's AI > Create > Stable Audio 3 panel handles the download + variant selection. The bundled runner at backend/ai/models/stable_audio_3.py reads the per-variant subdir name from the manifest and feeds it into the vendored stable_audio_3 package at backend/ai/stable_audio_3_vendor/.

Standalone

The repo can also be consumed directly by Stability AI's upstream stable-audio-3 package:

from stable_audio_3.loading_utils import load_diffusion_cond
from stable_audio_3.model import StableAudioModel
import json
from huggingface_hub import snapshot_download

# Pull one variant (e.g. small-sfx)
local = snapshot_download(
    repo_id="AEmotionStudio/stable-audio-3-mirrors",
    allow_patterns=["small-sfx/**"],
)

with open(f"{local}/small-sfx/model_config.json") as f:
    cfg = json.load(f)

inner = load_diffusion_cond(cfg, f"{local}/small-sfx/model.safetensors",
                            device="cuda", model_half=True)
inner.use_lora = False
inner.lora_names = []
model = StableAudioModel(inner, cfg, "cuda", model_half=True)

audio = model.generate(
    prompt="heavy rain on a tin roof with distant thunder",
    duration=10,
    steps=8,
    cfg_scale=1.0,
)

Attribution

  • Models: Stability AI — Stable Audio 3 (blog, upstream code: Stability-AI/stable-audio-3).
  • Text encoder: Google T5-Gemma (bundled in each generative subdir).
  • Autoencoder: Stability AI SAME — Semantic-Acoustic Music Encoder.

This mirror exists to bundle the family + extras into a single browseable HF repo for the MAESTRO desktop app. It does not modify the weights; report quality or licensing issues to the upstream repos.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support