license: other
license_name: stability-ai-community-license
license_link: >-
https://huggingface.co/stabilityai/stable-audio-3-small-sfx/blob/main/LICENSE.md
tags:
- audio
- text-to-audio
- sound-effects
- stable-audio
- onnx
- q4
- matmulnbits
- onnxruntime-web
- webgpu
library_name: onnxruntime
base_model: stabilityai/stable-audio-3-small-sfx
stable-audio-3-small-sfx β ONNX (browser / WebGPU)
The SFX-specific ONNX graphs for stabilityai/stable-audio-3-small-sfx,
for text-to-sound-effect generation in the browser via onnxruntime-web.
This model shares its text encoder (T5Gemma) and autoencoder/decoder with the music
model β those weights are bit-identical β so only the SFX-specific parts live here. Reuse the
text encoder and decoder from
lsb/stable-audio-3-small-music-onnx.
Files
dit_q4.onnx + dit_q4.data SFX diffusion transformer, int4 MatMulNBits (~320 MB)
number_conditioner.onnx SFX duration embedder
padding_embedding.json SFX prompt padding vector (768 floats, see below)
Linear/MatMul weights are int4 MatMulNBits (block_size 32); everything else (Conv, norms,
the embedding Gather) stays fp32 β so every op has a WebGPU kernel and the graph runs on the
webgpu execution provider (with wasm fallback).
Pipeline (identical to the music model)
tokens β text_encoder(lsb) β [override pad rows with padding_embedding]
seconds β number_conditioner β duration token
cross = [text(256); duration(1)] (1,257,768); global = duration (1,768)
pingpong sampler, 8 steps, LogSNRShift(rate=0, anchor_logsnr=-6.2, logsnr_end=2.0), CFG off
latents β decoder(lsb) β stereo 44.1 kHz
DiT I/O
x float32 (1, 256, t_lat)
t float32 (1,)
cross_attn_cond float32 (1, 257, 768)
global_embed float32 (1, 768)
local_add_cond float32 (1, 257, t_lat) zeros for plain text-to-audio
padding_mask bool (1, t_lat) all-true
β out float32 (1, 256, t_lat)
t_lat = ceil((seconds + 6) Β· 44100 / 8192) Β· 2.
padding_embedding
The prompt conditioner replaces padded token positions (where attention_mask == 0) with a
learned vector. That vector differs between the SFX and music models, so after running the
(shared) text encoder, overwrite the padded rows of last_hidden_state with this 768-vector.
License
Inherits the Stability AI Community License; the T5Gemma text encoder additionally falls under Google's Gemma Terms of Use.