| --- |
| license: other |
| license_name: stability-ai-community |
| license_link: https://huggingface.co/stabilityai/stable-audio-open-small/blob/main/LICENSE.md |
| tags: |
| - core-ai |
| - apple |
| - on-device |
| - text-to-audio |
| - music-generation |
| - stable-audio |
| - diffusion |
| base_model: |
| - stabilityai/stable-audio-open-small |
| pipeline_tag: text-to-audio |
| --- |
| |
| > **Mirror** of [`mlboydaisuke/Stable-Audio-Open-Small-CoreAI`](https://huggingface.co/mlboydaisuke/Stable-Audio-Open-Small-CoreAI) β the canonical repo ([CoreAI Model Zoo](https://github.com/john-rocky/coreai-model-zoo)). Updates land there first. |
|
|
|
|
| # Stable Audio Open Small β Core AI (on-device music generation) |
|
|
| **The model zoo's first MUSIC / AUDIO generation model for Apple Core AI.** Type a prompt, get ~11s |
| of 44.1 kHz stereo audio β generated entirely **on-device** on Apple Silicon. A community port of |
| [`stabilityai/stable-audio-open-small`](https://huggingface.co/stabilityai/stable-audio-open-small) |
| (Stability AI + Arm) to Core AI. |
|
|
| A latent **diffusion** text-to-audio model: a T5 text encoder conditions a DiT (diffusion transformer) |
| that denoises a latent over **8 rectified-flow steps**, then an Oobleck VAE decodes the latent to a |
| waveform. Distilled (ARC) for few-step generation, so it's fast. |
|
|
| <!-- gen-cards:use-it begin id=stable-audio-open-small (managed by scripts/gen-cards β edit cards.json / QuickStart.swift, not this block) --> |
|  |
| *Stable Audio Open Small on iPhone 17 Pro β the zoo's coreai-audio app, 12 s of audio in ~1 s.* |
|
|
| ## Use it |
|
|
| βΆοΈ **Run it (source)** β the [Music runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/Music) |
| (GUI + CLI, one app for every text-to-music model in the catalog): |
|
|
| ```bash |
| git clone https://github.com/john-rocky/coreai-kit |
| open coreai-kit/Examples/Music/Music.xcodeproj |
| # β Run, then pick "Stable Audio Open Small" in the model picker |
| |
| # agents / headless (macOS): |
| cd coreai-kit/Examples/Music |
| swift run music-cli --model stable-audio-open-small --prompt "128 BPM tech house drum loop" --output loop.wav |
| ``` |
|
|
| π» **Build with it** β complete; the glue is kit API, copy-paste runs: |
|
|
| ```swift |
| import CoreAIKit |
| |
| let musician = try await KitMusician(catalog: "stable-audio-open-small") |
| let audio = try await musician.generate(prompt) |
| // audio.samples: 44.1 kHz stereo (planar L/R) β play it or write a WAV |
| ``` |
|
|
| The take-home is [`Examples/Music/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/Music/Sources/QuickStart.swift) |
| β this exact code as one typed function, no UI; the CLI is an argument shell over it, and |
| the GUI drives the same `KitMusician(catalog:)` and plays the result. |
| Length? `generate(_:seconds:)` up to the model's ~11 s window. The WAV container is your |
| app's territory (the runner ships a 30-line writer with planar-stereo support). |
|
|
| **Integration checklist** |
|
|
| - SPM: `https://github.com/john-rocky/coreai-kit` β product **CoreAIKit** |
| - Info.plist: none needed |
| - Entitlements: none needed (macOS) |
| - First run downloads the model β 1.1 GB (Mac) β then it loads from the |
| local cache (Application Support; progress via the `downloadProgress` callback) |
| - Measure in Release β Debug is ~3Γ slower on per-token host work |
| <!-- gen-cards:use-it end --> |
|
|
| ## What's in the bundle (`macos/`) |
|
|
| Three Core AI `.aimodel` bundles + a tiny host sampler loop: |
|
|
| | bundle | role | I/O | |
| |---|---|---| |
| | `sa_cond_fp16b` | T5-base encoder + number conditioner | `input_ids[1,64], attention_mask[1,64], seconds_norm[1] β cross_attn_cond[1,65,768], global_embed[1,768], cond_mask[1,65]` | |
| | `sa_dit_fp16` | diffusion transformer (run 8Γ) | `x[1,64,256], t[1], cross_attn_cond, global_embed, cross_attn_cond_mask β v[1,64,256]` | |
| | `sa_vae_fp16` | Oobleck VAE decoder | `latent[1,64,256] β audio[1,2,524288]` | |
|
|
| **Host loop** (`StableAudioRunner`): tokenize (T5, `t5_tokenizer/`) β conditioner β start from Gaussian |
| noise β 8-step rectified-flow euler `x = x + (t_next β t)Β·v` over the fixed schedule |
| `[1.0, .9944, .9845, .9579, .8909, .7455, .5125, .2739] β 0` β VAE decode β 44.1 kHz stereo wav. |
| No KV cache, no CFG (cfg_scale 1.0 β the model is ARC-distilled). |
| |
| ## Performance (M4 Max, GPU) |
| |
| | metric | value | |
| |---|---| |
| | 8-step DiT | ~200 ms (25 ms/step) | |
| | VAE decode | ~185 ms | |
| | **total** | **~0.4 s for ~11.9 s of audio (~30Γ real-time)** | |
| | size | fp16, ~1.0 GB (DiT 651M + cond 210M + VAE 149M) | |
| |
| Numerics: each bundle engine-gated vs the reference at cos β₯ 0.9999; full pipeline reproduces the |
| reference audio exactly. |
| |
| ## Roadmap |
| |
| - iPhone (h18p) build β bundles AOT-compile; device RTF pending |
| - int8 (further size cut) |
| - a music-generation tab in the zoo app |
| |
| ## Credits & license |
| |
| A community **Core AI conversion** β all credit to **Stability AI** (and Arm) for |
| [Stable Audio Open Small](https://huggingface.co/stabilityai/stable-audio-open-small); T5 text encoder |
| by Google. This bundle is governed by the **[Stability AI Community License](https://huggingface.co/stabilityai/stable-audio-open-small/blob/main/LICENSE.md)** |
| (free for non-commercial use and for commercial use under \$1M annual revenue; review the license |
| before use). No retraining β conversion only. |
| |
| Part of the [Core AI model zoo](https://github.com/john-rocky/coreai-model-zoo). |
| |