| --- |
| license: other |
| license_name: stability-ai-community-license |
| license_link: https://huggingface.co/stabilityai/stable-audio-3-small-sfx/blob/main/LICENSE.md |
| tags: |
| - audio |
| - text-to-audio |
| - sound-effects |
| - stable-audio |
| - onnx |
| - q4 |
| - matmulnbits |
| - onnxruntime-web |
| - webgpu |
| library_name: onnxruntime |
| base_model: stabilityai/stable-audio-3-small-sfx |
| --- |
| |
| # stable-audio-3-small-sfx β ONNX (browser / WebGPU) |
|
|
| The **SFX-specific** ONNX graphs for [`stabilityai/stable-audio-3-small-sfx`](https://huggingface.co/stabilityai/stable-audio-3-small-sfx), |
| for text-to-sound-effect generation in the browser via [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web). |
|
|
| This model shares its **text encoder** (T5Gemma) and **autoencoder/decoder** with the music |
| model β those weights are bit-identical β so only the SFX-specific parts live here. Reuse the |
| text encoder and decoder from |
| [`lsb/stable-audio-3-small-music-onnx`](https://huggingface.co/lsb/stable-audio-3-small-music-onnx). |
|
|
| ## Files |
|
|
| ``` |
| dit_q4.onnx + dit_q4.data SFX diffusion transformer, int4 MatMulNBits (~320 MB) |
| number_conditioner.onnx SFX duration embedder |
| padding_embedding.json SFX prompt padding vector (768 floats, see below) |
| ``` |
|
|
| Linear/MatMul weights are int4 `MatMulNBits` (block_size 32); everything else (Conv, norms, |
| the embedding `Gather`) stays fp32 β so every op has a WebGPU kernel and the graph runs on the |
| `webgpu` execution provider (with `wasm` fallback). |
| |
| ## Pipeline (identical to the music model) |
| |
| ``` |
| tokens β text_encoder(lsb) β [override pad rows with padding_embedding] |
| seconds β number_conditioner β duration token |
| cross = [text(256); duration(1)] (1,257,768); global = duration (1,768) |
| pingpong sampler, 8 steps, LogSNRShift(rate=0, anchor_logsnr=-6.2, logsnr_end=2.0), CFG off |
| latents β decoder(lsb) β stereo 44.1 kHz |
| ``` |
| |
| ### DiT I/O |
| |
| ``` |
| x float32 (1, 256, t_lat) |
| t float32 (1,) |
| cross_attn_cond float32 (1, 257, 768) |
| global_embed float32 (1, 768) |
| local_add_cond float32 (1, 257, t_lat) zeros for plain text-to-audio |
| padding_mask bool (1, t_lat) all-true |
| β out float32 (1, 256, t_lat) |
| ``` |
| |
| `t_lat = ceil((seconds + 6) Β· 44100 / 8192) Β· 2`. |
| |
| ### padding_embedding |
| |
| The prompt conditioner replaces padded token positions (where `attention_mask == 0`) with a |
| learned vector. That vector differs between the SFX and music models, so after running the |
| (shared) text encoder, overwrite the padded rows of `last_hidden_state` with this 768-vector. |
| |
| ## License |
| |
| Inherits the **Stability AI Community License**; the T5Gemma text encoder additionally falls |
| under Google's Gemma Terms of Use. |
| |