Text-to-Audio
Stable Audio 3
Safetensors
audio
audio-generation
audio-to-audio
inpainting
stability-ai
Instructions to use AEmotionStudio/stable-audio-3-mirrors with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Stable Audio 3
How to use AEmotionStudio/stable-audio-3-mirrors with Stable Audio 3:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: other | |
| license_name: stability-ai-community-license | |
| license_link: https://stability.ai/license | |
| library_name: stable-audio-3 | |
| tags: | |
| - audio | |
| - audio-generation | |
| - text-to-audio | |
| - audio-to-audio | |
| - inpainting | |
| - stable-audio-3 | |
| - stability-ai | |
| - safetensors | |
| pipeline_tag: text-to-audio | |
| # Stable Audio 3 — bundled mirror | |
| Self-contained inference bundle for the [MAESTRO](https://github.com/AEmotionStudio/MAESTRO) desktop app. | |
| One-to-one mirror of Stability AI's [Stable Audio 3 collection](https://huggingface.co/collections/stabilityai/stable-audio-3) and the [extras collection](https://huggingface.co/collections/stabilityai/stable-audio-3-extra) (base checkpoints + standalone autoencoders), bundled into a single browseable HF repo so the MAESTRO panel can pick the variant a user wants without juggling eight separate downloads. | |
| ## License — Stability AI Community License | |
| All weights in this repository are released by Stability AI under the **[Stability AI Community License](https://stability.ai/license)**: | |
| > Free for organizations with **under $1M annual revenue**. Commercial use of the models and outputs is permitted within that threshold; redistribution, fine-tuning, and derivative works are explicitly allowed. **Outputs are yours.** Above the revenue threshold, contact Stability AI for an Enterprise License. | |
| The upstream [`stable-audio-3` source code](https://github.com/Stability-AI/stable-audio-3) is released separately under **MIT**. | |
| ### Gated subdirs | |
| Three subdirs mirror upstream repos that are **gated** on huggingface.co — you must accept Stability AI's terms (and the Gemma terms-of-use, since the text encoder is T5-Gemma) before this mirror's gating allows access: | |
| - `small-music/` (mirror of [`stabilityai/stable-audio-3-small-music`](https://huggingface.co/stabilityai/stable-audio-3-small-music)) | |
| - `small-sfx/` (mirror of [`stabilityai/stable-audio-3-small-sfx`](https://huggingface.co/stabilityai/stable-audio-3-small-sfx)) | |
| - `medium/` (mirror of [`stabilityai/stable-audio-3-medium`](https://huggingface.co/stabilityai/stable-audio-3-medium)) | |
| The base checkpoints and SAME autoencoders are open. | |
| ## Contents | |
| | Subdir | Role | Params | Max duration | Upstream | | |
| |---|---|---|---|---| | |
| | `small-music/` | Post-trained text → audio (music) | 433 M | 120 s | `stabilityai/stable-audio-3-small-music` *(gated)* | | |
| | `small-sfx/` | Post-trained text → audio (SFX) | 433 M | 120 s | `stabilityai/stable-audio-3-small-sfx` *(gated)* | | |
| | `medium/` | Post-trained text → audio (music + SFX) | 1.4 B | 380 s | `stabilityai/stable-audio-3-medium` *(gated)* | | |
| | `small-music-base/` | Base ckpt for LoRA fine-tuning | 433 M | 120 s | `stabilityai/stable-audio-3-small-music-base` | | |
| | `small-sfx-base/` | Base ckpt for LoRA fine-tuning | 433 M | 120 s | `stabilityai/stable-audio-3-small-sfx-base` | | |
| | `medium-base/` | Base ckpt for LoRA fine-tuning | 1.4 B | 380 s | `stabilityai/stable-audio-3-medium-base` | | |
| | `same-s/` | SAME-Small standalone autoencoder | ~50 M | — | `stabilityai/SAME-S` | | |
| | `same-l/` | SAME-Large standalone autoencoder | ~200 M | — | `stabilityai/SAME-L` | | |
| Every subdir contains `model.safetensors` + `model_config.json` (plus the post-trained / base variants include the bundled T5-Gemma text encoder + SAME pretransform; SAME repos are AE-only). | |
| ## Capabilities | |
| All six generative variants share a single inference surface in MAESTRO with four modes: | |
| - **Text → Audio** — prompt-only generation, stereo 44.1 kHz | |
| - **Audio → Audio** — style transfer / restyling with an adjustable `init_noise_level` | |
| - **Inpaint** — multi-region regeneration of a source clip; non-region time is preserved verbatim | |
| - **Continue** — extend an existing clip past its end | |
| Generation knobs exposed: prompt, negative prompt, duration, steps, CFG scale, APG scale, seed, batch size, sampler type (`dpmpp-3m-sde` / `dpmpp-2m` / `euler` / `heun`), distribution shift (`logSNR` / `flux` / `identity`), precision (fp16 / fp32), chunked decode, and a user-loadable stackable LoRA stack. | |
| > **Medium variants** require **[Flash Attention 2](https://github.com/Dao-AILab/flash-attention)** for the SAME-Large decoder path. Without `flash-attn` installed, Medium generation degrades to static-glitch output. Small variants do not require it. | |
| ## Format | |
| - **All weights are `safetensors`.** No `.pt` / `.ckpt` / `.bin` in this mirror. | |
| - Mirror is **bf16** — re-saved via `safetensors.torch.save_model` (preserves shared RotaryEmbedding buffers that bare `save_file` would corrupt). Bytewise this halves disk size vs the fp32 upstream. The MAESTRO runner upcasts to fp32 transiently during `load_state_dict` then casts to fp16 (`model_half=True`) for inference — runtime VRAM is unchanged from the fp32 mirror, but disk + I/O + initial safetensors-read CPU spike are all halved. | |
| - Approximate disk sizes per subdir: small variants ~1.14 GB each, medium variants ~4.61 GB each, SAME-S ~0.22 GB, SAME-L ~1.70 GB. Total mirror footprint ≈ 15.7 GB. | |
| ## Usage | |
| ### Inside MAESTRO | |
| The MAESTRO desktop app's `AI > Create > Stable Audio 3` panel handles the download + variant selection. The bundled runner at `backend/ai/models/stable_audio_3.py` reads the per-variant subdir name from the manifest and feeds it into the vendored `stable_audio_3` package at `backend/ai/stable_audio_3_vendor/`. | |
| ### Standalone | |
| The repo can also be consumed directly by Stability AI's upstream [`stable-audio-3` package](https://github.com/Stability-AI/stable-audio-3): | |
| ```python | |
| from stable_audio_3.loading_utils import load_diffusion_cond | |
| from stable_audio_3.model import StableAudioModel | |
| import json | |
| from huggingface_hub import snapshot_download | |
| # Pull one variant (e.g. small-sfx) | |
| local = snapshot_download( | |
| repo_id="AEmotionStudio/stable-audio-3-mirrors", | |
| allow_patterns=["small-sfx/**"], | |
| ) | |
| with open(f"{local}/small-sfx/model_config.json") as f: | |
| cfg = json.load(f) | |
| inner = load_diffusion_cond(cfg, f"{local}/small-sfx/model.safetensors", | |
| device="cuda", model_half=True) | |
| inner.use_lora = False | |
| inner.lora_names = [] | |
| model = StableAudioModel(inner, cfg, "cuda", model_half=True) | |
| audio = model.generate( | |
| prompt="heavy rain on a tin roof with distant thunder", | |
| duration=10, | |
| steps=8, | |
| cfg_scale=1.0, | |
| ) | |
| ``` | |
| ## Attribution | |
| - **Models:** Stability AI — *Stable Audio 3* ([blog](https://stability.ai/news/stable-audio-3-open), upstream code: [`Stability-AI/stable-audio-3`](https://github.com/Stability-AI/stable-audio-3)). | |
| - **Text encoder:** Google T5-Gemma (bundled in each generative subdir). | |
| - **Autoencoder:** Stability AI SAME — *Semantic-Acoustic Music Encoder*. | |
| This mirror exists to bundle the family + extras into a single browseable HF repo for the MAESTRO desktop app. It does not modify the weights; report quality or licensing issues to the upstream repos. | |