Stable Audio 3 — bundled mirror

Self-contained inference bundle for the MAESTRO desktop app. One-to-one mirror of Stability AI's Stable Audio 3 collection and the extras collection (base checkpoints + standalone autoencoders), bundled into a single browseable HF repo so the MAESTRO panel can pick the variant a user wants without juggling eight separate downloads.

License — Stability AI Community License

All weights in this repository are released by Stability AI under the Stability AI Community License:

Free for organizations with under $1M annual revenue. Commercial use of the models and outputs is permitted within that threshold; redistribution, fine-tuning, and derivative works are explicitly allowed. Outputs are yours. Above the revenue threshold, contact Stability AI for an Enterprise License.

The upstream stable-audio-3 source code is released separately under MIT.

Gated subdirs

Three subdirs mirror upstream repos that are gated on huggingface.co — you must accept Stability AI's terms (and the Gemma terms-of-use, since the text encoder is T5-Gemma) before this mirror's gating allows access:

small-music/ (mirror of stabilityai/stable-audio-3-small-music)
small-sfx/ (mirror of stabilityai/stable-audio-3-small-sfx)
medium/ (mirror of stabilityai/stable-audio-3-medium)

The base checkpoints and SAME autoencoders are open.

Subdir	Role	Params	Max duration	Upstream
`small-music/`	Post-trained text → audio (music)	433 M	120 s	`stabilityai/stable-audio-3-small-music` (gated)
`small-sfx/`	Post-trained text → audio (SFX)	433 M	120 s	`stabilityai/stable-audio-3-small-sfx` (gated)
`medium/`	Post-trained text → audio (music + SFX)	1.4 B	380 s	`stabilityai/stable-audio-3-medium` (gated)
`small-music-base/`	Base ckpt for LoRA fine-tuning	433 M	120 s	`stabilityai/stable-audio-3-small-music-base`
`small-sfx-base/`	Base ckpt for LoRA fine-tuning	433 M	120 s	`stabilityai/stable-audio-3-small-sfx-base`
`medium-base/`	Base ckpt for LoRA fine-tuning	1.4 B	380 s	`stabilityai/stable-audio-3-medium-base`
`same-s/`	SAME-Small standalone autoencoder	~50 M	—	`stabilityai/SAME-S`
`same-l/`	SAME-Large standalone autoencoder	~200 M	—	`stabilityai/SAME-L`

Every subdir contains model.safetensors + model_config.json (plus the post-trained / base variants include the bundled T5-Gemma text encoder + SAME pretransform; SAME repos are AE-only).

Capabilities

All six generative variants share a single inference surface in MAESTRO with four modes:

Text → Audio — prompt-only generation, stereo 44.1 kHz
Audio → Audio — style transfer / restyling with an adjustable init_noise_level
Inpaint — multi-region regeneration of a source clip; non-region time is preserved verbatim
Continue — extend an existing clip past its end

Generation knobs exposed: prompt, negative prompt, duration, steps, CFG scale, APG scale, seed, batch size, sampler type (dpmpp-3m-sde / dpmpp-2m / euler / heun), distribution shift (logSNR / flux / identity), precision (fp16 / fp32), chunked decode, and a user-loadable stackable LoRA stack.

Medium variants require Flash Attention 2 for the SAME-Large decoder path. Without flash-attn installed, Medium generation degrades to static-glitch output. Small variants do not require it.

Format

All weights are safetensors. No .pt / .ckpt / .bin in this mirror.
Mirror is bf16 — re-saved via safetensors.torch.save_model (preserves shared RotaryEmbedding buffers that bare save_file would corrupt). Bytewise this halves disk size vs the fp32 upstream. The MAESTRO runner upcasts to fp32 transiently during load_state_dict then casts to fp16 (model_half=True) for inference — runtime VRAM is unchanged from the fp32 mirror, but disk + I/O + initial safetensors-read CPU spike are all halved.
Approximate disk sizes per subdir: small variants ~1.14 GB each, medium variants ~4.61 GB each, SAME-S ~0.22 GB, SAME-L ~1.70 GB. Total mirror footprint ≈ 15.7 GB.

Usage

Inside MAESTRO

The MAESTRO desktop app's AI > Create > Stable Audio 3 panel handles the download + variant selection. The bundled runner at backend/ai/models/stable_audio_3.py reads the per-variant subdir name from the manifest and feeds it into the vendored stable_audio_3 package at backend/ai/stable_audio_3_vendor/.

Standalone

The repo can also be consumed directly by Stability AI's upstream stable-audio-3 package:

from stable_audio_3.loading_utils import load_diffusion_cond
from stable_audio_3.model import StableAudioModel
import json
from huggingface_hub import snapshot_download

# Pull one variant (e.g. small-sfx)
local = snapshot_download(
    repo_id="AEmotionStudio/stable-audio-3-mirrors",
    allow_patterns=["small-sfx/**"],
)

with open(f"{local}/small-sfx/model_config.json") as f:
    cfg = json.load(f)

inner = load_diffusion_cond(cfg, f"{local}/small-sfx/model.safetensors",
                            device="cuda", model_half=True)
inner.use_lora = False
inner.lora_names = []
model = StableAudioModel(inner, cfg, "cuda", model_half=True)

audio = model.generate(
    prompt="heavy rain on a tin roof with distant thunder",
    duration=10,
    steps=8,
    cfg_scale=1.0,
)

Attribution

Models: Stability AI — Stable Audio 3 (blog, upstream code: Stability-AI/stable-audio-3).
Text encoder: Google T5-Gemma (bundled in each generative subdir).
Autoencoder: Stability AI SAME — Semantic-Acoustic Music Encoder.

This mirror exists to bundle the family + extras into a single browseable HF repo for the MAESTRO desktop app. It does not modify the weights; report quality or licensing issues to the upstream repos.

Downloads last month: -

AEmotionStudio
/

stable-audio-3-mirrors