--- license: other license_name: stability-ai-community-license license_link: https://huggingface.co/stabilityai/stable-audio-3-small-sfx/blob/main/LICENSE.md tags: - audio - text-to-audio - sound-effects - stable-audio - onnx - q4 - matmulnbits - onnxruntime-web - webgpu library_name: onnxruntime base_model: stabilityai/stable-audio-3-small-sfx --- # stable-audio-3-small-sfx — ONNX (browser / WebGPU) The **SFX-specific** ONNX graphs for [`stabilityai/stable-audio-3-small-sfx`](https://huggingface.co/stabilityai/stable-audio-3-small-sfx), for text-to-sound-effect generation in the browser via [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web). This model shares its **text encoder** (T5Gemma) and **autoencoder/decoder** with the music model — those weights are bit-identical — so only the SFX-specific parts live here. Reuse the text encoder and decoder from [`lsb/stable-audio-3-small-music-onnx`](https://huggingface.co/lsb/stable-audio-3-small-music-onnx). ## Files ``` dit_q4.onnx + dit_q4.data SFX diffusion transformer, int4 MatMulNBits (~320 MB) number_conditioner.onnx SFX duration embedder padding_embedding.json SFX prompt padding vector (768 floats, see below) ``` Linear/MatMul weights are int4 `MatMulNBits` (block_size 32); everything else (Conv, norms, the embedding `Gather`) stays fp32 — so every op has a WebGPU kernel and the graph runs on the `webgpu` execution provider (with `wasm` fallback). ## Pipeline (identical to the music model) ``` tokens → text_encoder(lsb) → [override pad rows with padding_embedding] seconds → number_conditioner → duration token cross = [text(256); duration(1)] (1,257,768); global = duration (1,768) pingpong sampler, 8 steps, LogSNRShift(rate=0, anchor_logsnr=-6.2, logsnr_end=2.0), CFG off latents → decoder(lsb) → stereo 44.1 kHz ``` ### DiT I/O ``` x float32 (1, 256, t_lat) t float32 (1,) cross_attn_cond float32 (1, 257, 768) global_embed float32 (1, 768) local_add_cond float32 (1, 257, t_lat) zeros for plain text-to-audio padding_mask bool (1, t_lat) all-true → out float32 (1, 256, t_lat) ``` `t_lat = ceil((seconds + 6) · 44100 / 8192) · 2`. ### padding_embedding The prompt conditioner replaces padded token positions (where `attention_mask == 0`) with a learned vector. That vector differs between the SFX and music models, so after running the (shared) text encoder, overwrite the padded rows of `last_hidden_state` with this 768-vector. ## License Inherits the **Stability AI Community License**; the T5Gemma text encoder additionally falls under Google's Gemma Terms of Use.