Stable Audio 3 — bundled mirror
Self-contained inference bundle for the MAESTRO desktop app. One-to-one mirror of Stability AI's Stable Audio 3 collection and the extras collection (base checkpoints + standalone autoencoders), bundled into a single browseable HF repo so the MAESTRO panel can pick the variant a user wants without juggling eight separate downloads.
License — Stability AI Community License
All weights in this repository are released by Stability AI under the Stability AI Community License:
Free for organizations with under $1M annual revenue. Commercial use of the models and outputs is permitted within that threshold; redistribution, fine-tuning, and derivative works are explicitly allowed. Outputs are yours. Above the revenue threshold, contact Stability AI for an Enterprise License.
The upstream stable-audio-3 source code is released separately under MIT.
Gated subdirs
Three subdirs mirror upstream repos that are gated on huggingface.co — you must accept Stability AI's terms (and the Gemma terms-of-use, since the text encoder is T5-Gemma) before this mirror's gating allows access:
small-music/(mirror ofstabilityai/stable-audio-3-small-music)small-sfx/(mirror ofstabilityai/stable-audio-3-small-sfx)medium/(mirror ofstabilityai/stable-audio-3-medium)
The base checkpoints and SAME autoencoders are open.
Contents
| Subdir | Role | Params | Max duration | Upstream |
|---|---|---|---|---|
small-music/ |
Post-trained text → audio (music) | 433 M | 120 s | stabilityai/stable-audio-3-small-music (gated) |
small-sfx/ |
Post-trained text → audio (SFX) | 433 M | 120 s | stabilityai/stable-audio-3-small-sfx (gated) |
medium/ |
Post-trained text → audio (music + SFX) | 1.4 B | 380 s | stabilityai/stable-audio-3-medium (gated) |
small-music-base/ |
Base ckpt for LoRA fine-tuning | 433 M | 120 s | stabilityai/stable-audio-3-small-music-base |
small-sfx-base/ |
Base ckpt for LoRA fine-tuning | 433 M | 120 s | stabilityai/stable-audio-3-small-sfx-base |
medium-base/ |
Base ckpt for LoRA fine-tuning | 1.4 B | 380 s | stabilityai/stable-audio-3-medium-base |
same-s/ |
SAME-Small standalone autoencoder | ~50 M | — | stabilityai/SAME-S |
same-l/ |
SAME-Large standalone autoencoder | ~200 M | — | stabilityai/SAME-L |
Every subdir contains model.safetensors + model_config.json (plus the post-trained / base variants include the bundled T5-Gemma text encoder + SAME pretransform; SAME repos are AE-only).
Capabilities
All six generative variants share a single inference surface in MAESTRO with four modes:
- Text → Audio — prompt-only generation, stereo 44.1 kHz
- Audio → Audio — style transfer / restyling with an adjustable
init_noise_level - Inpaint — multi-region regeneration of a source clip; non-region time is preserved verbatim
- Continue — extend an existing clip past its end
Generation knobs exposed: prompt, negative prompt, duration, steps, CFG scale, APG scale, seed, batch size, sampler type (dpmpp-3m-sde / dpmpp-2m / euler / heun), distribution shift (logSNR / flux / identity), precision (fp16 / fp32), chunked decode, and a user-loadable stackable LoRA stack.
Medium variants require Flash Attention 2 for the SAME-Large decoder path. Without
flash-attninstalled, Medium generation degrades to static-glitch output. Small variants do not require it.
Format
- All weights are
safetensors. No.pt/.ckpt/.binin this mirror. - Mirror is bf16 — re-saved via
safetensors.torch.save_model(preserves shared RotaryEmbedding buffers that baresave_filewould corrupt). Bytewise this halves disk size vs the fp32 upstream. The MAESTRO runner upcasts to fp32 transiently duringload_state_dictthen casts to fp16 (model_half=True) for inference — runtime VRAM is unchanged from the fp32 mirror, but disk + I/O + initial safetensors-read CPU spike are all halved. - Approximate disk sizes per subdir: small variants ~1.14 GB each, medium variants ~4.61 GB each, SAME-S ~0.22 GB, SAME-L ~1.70 GB. Total mirror footprint ≈ 15.7 GB.
Usage
Inside MAESTRO
The MAESTRO desktop app's AI > Create > Stable Audio 3 panel handles the download + variant selection. The bundled runner at backend/ai/models/stable_audio_3.py reads the per-variant subdir name from the manifest and feeds it into the vendored stable_audio_3 package at backend/ai/stable_audio_3_vendor/.
Standalone
The repo can also be consumed directly by Stability AI's upstream stable-audio-3 package:
from stable_audio_3.loading_utils import load_diffusion_cond
from stable_audio_3.model import StableAudioModel
import json
from huggingface_hub import snapshot_download
# Pull one variant (e.g. small-sfx)
local = snapshot_download(
repo_id="AEmotionStudio/stable-audio-3-mirrors",
allow_patterns=["small-sfx/**"],
)
with open(f"{local}/small-sfx/model_config.json") as f:
cfg = json.load(f)
inner = load_diffusion_cond(cfg, f"{local}/small-sfx/model.safetensors",
device="cuda", model_half=True)
inner.use_lora = False
inner.lora_names = []
model = StableAudioModel(inner, cfg, "cuda", model_half=True)
audio = model.generate(
prompt="heavy rain on a tin roof with distant thunder",
duration=10,
steps=8,
cfg_scale=1.0,
)
Attribution
- Models: Stability AI — Stable Audio 3 (blog, upstream code:
Stability-AI/stable-audio-3). - Text encoder: Google T5-Gemma (bundled in each generative subdir).
- Autoencoder: Stability AI SAME — Semantic-Acoustic Music Encoder.
This mirror exists to bundle the family + extras into a single browseable HF repo for the MAESTRO desktop app. It does not modify the weights; report quality or licensing issues to the upstream repos.