encoder-onnx / sfx /README.md
bgkb's picture
Add SFX README.md
51015c9 verified
---
license: other
license_name: stability-ai-community-license
license_link: https://huggingface.co/stabilityai/stable-audio-3-small-sfx/blob/main/LICENSE.md
tags:
- audio
- text-to-audio
- sound-effects
- stable-audio
- onnx
- q4
- matmulnbits
- onnxruntime-web
- webgpu
library_name: onnxruntime
base_model: stabilityai/stable-audio-3-small-sfx
---
# stable-audio-3-small-sfx β€” ONNX (browser / WebGPU)
The **SFX-specific** ONNX graphs for [`stabilityai/stable-audio-3-small-sfx`](https://huggingface.co/stabilityai/stable-audio-3-small-sfx),
for text-to-sound-effect generation in the browser via [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web).
This model shares its **text encoder** (T5Gemma) and **autoencoder/decoder** with the music
model β€” those weights are bit-identical β€” so only the SFX-specific parts live here. Reuse the
text encoder and decoder from
[`lsb/stable-audio-3-small-music-onnx`](https://huggingface.co/lsb/stable-audio-3-small-music-onnx).
## Files
```
dit_q4.onnx + dit_q4.data SFX diffusion transformer, int4 MatMulNBits (~320 MB)
number_conditioner.onnx SFX duration embedder
padding_embedding.json SFX prompt padding vector (768 floats, see below)
```
Linear/MatMul weights are int4 `MatMulNBits` (block_size 32); everything else (Conv, norms,
the embedding `Gather`) stays fp32 β€” so every op has a WebGPU kernel and the graph runs on the
`webgpu` execution provider (with `wasm` fallback).
## Pipeline (identical to the music model)
```
tokens β†’ text_encoder(lsb) β†’ [override pad rows with padding_embedding]
seconds β†’ number_conditioner β†’ duration token
cross = [text(256); duration(1)] (1,257,768); global = duration (1,768)
pingpong sampler, 8 steps, LogSNRShift(rate=0, anchor_logsnr=-6.2, logsnr_end=2.0), CFG off
latents β†’ decoder(lsb) β†’ stereo 44.1 kHz
```
### DiT I/O
```
x float32 (1, 256, t_lat)
t float32 (1,)
cross_attn_cond float32 (1, 257, 768)
global_embed float32 (1, 768)
local_add_cond float32 (1, 257, t_lat) zeros for plain text-to-audio
padding_mask bool (1, t_lat) all-true
β†’ out float32 (1, 256, t_lat)
```
`t_lat = ceil((seconds + 6) Β· 44100 / 8192) Β· 2`.
### padding_embedding
The prompt conditioner replaces padded token positions (where `attention_mask == 0`) with a
learned vector. That vector differs between the SFX and music models, so after running the
(shared) text encoder, overwrite the padded rows of `last_hidden_state` with this 768-vector.
## License
Inherits the **Stability AI Community License**; the T5Gemma text encoder additionally falls
under Google's Gemma Terms of Use.