bgkb
/

encoder-onnx

+---
+license: other
+license_name: stability-ai-community-license
+license_link: https://huggingface.co/stabilityai/stable-audio-3-small-sfx/blob/main/LICENSE.md
+tags:
+  - audio
+  - text-to-audio
+  - sound-effects
+  - stable-audio
+  - onnx
+  - q4
+  - matmulnbits
+  - onnxruntime-web
+  - webgpu
+library_name: onnxruntime
+base_model: stabilityai/stable-audio-3-small-sfx
+---
+# stable-audio-3-small-sfx — ONNX (browser / WebGPU)
+The **SFX-specific** ONNX graphs for [`stabilityai/stable-audio-3-small-sfx`](https://huggingface.co/stabilityai/stable-audio-3-small-sfx),
+for text-to-sound-effect generation in the browser via [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web).
+This model shares its **text encoder** (T5Gemma) and **autoencoder/decoder** with the music
+model — those weights are bit-identical — so only the SFX-specific parts live here. Reuse the
+text encoder and decoder from
+[`lsb/stable-audio-3-small-music-onnx`](https://huggingface.co/lsb/stable-audio-3-small-music-onnx).
+## Files
+```
+dit_q4.onnx + dit_q4.data     SFX diffusion transformer, int4 MatMulNBits (~320 MB)
+number_conditioner.onnx       SFX duration embedder
+padding_embedding.json        SFX prompt padding vector (768 floats, see below)
+```
+Linear/MatMul weights are int4 `MatMulNBits` (block_size 32); everything else (Conv, norms,
+the embedding `Gather`) stays fp32 — so every op has a WebGPU kernel and the graph runs on the
+`webgpu` execution provider (with `wasm` fallback).
+## Pipeline (identical to the music model)
+```
+tokens → text_encoder(lsb) → [override pad rows with padding_embedding]
+seconds → number_conditioner → duration token
+cross = [text(256); duration(1)] (1,257,768);  global = duration (1,768)
+pingpong sampler, 8 steps, LogSNRShift(rate=0, anchor_logsnr=-6.2, logsnr_end=2.0), CFG off
+latents → decoder(lsb) → stereo 44.1 kHz
+```
+### DiT I/O
+```
+x               float32 (1, 256, t_lat)
+t               float32 (1,)
+cross_attn_cond float32 (1, 257, 768)
+global_embed    float32 (1, 768)
+local_add_cond  float32 (1, 257, t_lat)   zeros for plain text-to-audio
+padding_mask    bool    (1, t_lat)        all-true
+→ out           float32 (1, 256, t_lat)
+```
+`t_lat = ceil((seconds + 6) · 44100 / 8192) · 2`.
+### padding_embedding
+The prompt conditioner replaces padded token positions (where `attention_mask == 0`) with a
+learned vector. That vector differs between the SFX and music models, so after running the
+(shared) text encoder, overwrite the padded rows of `last_hidden_state` with this 768-vector.
+## License
+Inherits the **Stability AI Community License**; the T5Gemma text encoder additionally falls
+under Google's Gemma Terms of Use.