mlboydaisuke's picture
Mirror of mlboydaisuke/Stable-Audio-Open-Small-CoreAI
7bd8a20 verified
|
Raw
History Blame Contribute Delete
5.41 kB
---
license: other
license_name: stability-ai-community
license_link: https://huggingface.co/stabilityai/stable-audio-open-small/blob/main/LICENSE.md
tags:
- core-ai
- apple
- on-device
- text-to-audio
- music-generation
- stable-audio
- diffusion
base_model:
- stabilityai/stable-audio-open-small
pipeline_tag: text-to-audio
---
> **Mirror** of [`mlboydaisuke/Stable-Audio-Open-Small-CoreAI`](https://huggingface.co/mlboydaisuke/Stable-Audio-Open-Small-CoreAI) β€” the canonical repo ([CoreAI Model Zoo](https://github.com/john-rocky/coreai-model-zoo)). Updates land there first.
# Stable Audio Open Small β€” Core AI (on-device music generation)
**The model zoo's first MUSIC / AUDIO generation model for Apple Core AI.** Type a prompt, get ~11s
of 44.1 kHz stereo audio β€” generated entirely **on-device** on Apple Silicon. A community port of
[`stabilityai/stable-audio-open-small`](https://huggingface.co/stabilityai/stable-audio-open-small)
(Stability AI + Arm) to Core AI.
A latent **diffusion** text-to-audio model: a T5 text encoder conditions a DiT (diffusion transformer)
that denoises a latent over **8 rectified-flow steps**, then an Oobleck VAE decodes the latent to a
waveform. Distilled (ARC) for few-step generation, so it's fast.
<!-- gen-cards:use-it begin id=stable-audio-open-small (managed by scripts/gen-cards β€” edit cards.json / QuickStart.swift, not this block) -->
![Stable Audio Open Small demo](https://huggingface.co/mlboydaisuke/Stable-Audio-Open-Small-CoreAI/resolve/main/demo.gif)
*Stable Audio Open Small on iPhone 17 Pro β€” the zoo's coreai-audio app, 12 s of audio in ~1 s.*
## Use it
▢️ **Run it (source)** β€” the [Music runner](https://github.com/john-rocky/coreai-kit/tree/main/Examples/Music)
(GUI + CLI, one app for every text-to-music model in the catalog):
```bash
git clone https://github.com/john-rocky/coreai-kit
open coreai-kit/Examples/Music/Music.xcodeproj
# β†’ Run, then pick "Stable Audio Open Small" in the model picker
# agents / headless (macOS):
cd coreai-kit/Examples/Music
swift run music-cli --model stable-audio-open-small --prompt "128 BPM tech house drum loop" --output loop.wav
```
πŸ’» **Build with it** β€” complete; the glue is kit API, copy-paste runs:
```swift
import CoreAIKit
let musician = try await KitMusician(catalog: "stable-audio-open-small")
let audio = try await musician.generate(prompt)
// audio.samples: 44.1 kHz stereo (planar L/R) β€” play it or write a WAV
```
The take-home is [`Examples/Music/Sources/QuickStart.swift`](https://github.com/john-rocky/coreai-kit/blob/main/Examples/Music/Sources/QuickStart.swift)
β€” this exact code as one typed function, no UI; the CLI is an argument shell over it, and
the GUI drives the same `KitMusician(catalog:)` and plays the result.
Length? `generate(_:seconds:)` up to the model's ~11 s window. The WAV container is your
app's territory (the runner ships a 30-line writer with planar-stereo support).
**Integration checklist**
- SPM: `https://github.com/john-rocky/coreai-kit` β†’ product **CoreAIKit**
- Info.plist: none needed
- Entitlements: none needed (macOS)
- First run downloads the model β€” 1.1 GB (Mac) β€” then it loads from the
local cache (Application Support; progress via the `downloadProgress` callback)
- Measure in Release β€” Debug is ~3Γ— slower on per-token host work
<!-- gen-cards:use-it end -->
## What's in the bundle (`macos/`)
Three Core AI `.aimodel` bundles + a tiny host sampler loop:
| bundle | role | I/O |
|---|---|---|
| `sa_cond_fp16b` | T5-base encoder + number conditioner | `input_ids[1,64], attention_mask[1,64], seconds_norm[1] β†’ cross_attn_cond[1,65,768], global_embed[1,768], cond_mask[1,65]` |
| `sa_dit_fp16` | diffusion transformer (run 8Γ—) | `x[1,64,256], t[1], cross_attn_cond, global_embed, cross_attn_cond_mask β†’ v[1,64,256]` |
| `sa_vae_fp16` | Oobleck VAE decoder | `latent[1,64,256] β†’ audio[1,2,524288]` |
**Host loop** (`StableAudioRunner`): tokenize (T5, `t5_tokenizer/`) β†’ conditioner β†’ start from Gaussian
noise β†’ 8-step rectified-flow euler `x = x + (t_next βˆ’ t)Β·v` over the fixed schedule
`[1.0, .9944, .9845, .9579, .8909, .7455, .5125, .2739] β†’ 0` β†’ VAE decode β†’ 44.1 kHz stereo wav.
No KV cache, no CFG (cfg_scale 1.0 β€” the model is ARC-distilled).
## Performance (M4 Max, GPU)
| metric | value |
|---|---|
| 8-step DiT | ~200 ms (25 ms/step) |
| VAE decode | ~185 ms |
| **total** | **~0.4 s for ~11.9 s of audio (~30Γ— real-time)** |
| size | fp16, ~1.0 GB (DiT 651M + cond 210M + VAE 149M) |
Numerics: each bundle engine-gated vs the reference at cos β‰₯ 0.9999; full pipeline reproduces the
reference audio exactly.
## Roadmap
- iPhone (h18p) build β€” bundles AOT-compile; device RTF pending
- int8 (further size cut)
- a music-generation tab in the zoo app
## Credits & license
A community **Core AI conversion** β€” all credit to **Stability AI** (and Arm) for
[Stable Audio Open Small](https://huggingface.co/stabilityai/stable-audio-open-small); T5 text encoder
by Google. This bundle is governed by the **[Stability AI Community License](https://huggingface.co/stabilityai/stable-audio-open-small/blob/main/LICENSE.md)**
(free for non-commercial use and for commercial use under \$1M annual revenue; review the license
before use). No retraining β€” conversion only.
Part of the [Core AI model zoo](https://github.com/john-rocky/coreai-model-zoo).