encoder-onnx / sfx /README.md
bgkb's picture
Add SFX README.md
51015c9 verified
metadata
license: other
license_name: stability-ai-community-license
license_link: >-
  https://huggingface.co/stabilityai/stable-audio-3-small-sfx/blob/main/LICENSE.md
tags:
  - audio
  - text-to-audio
  - sound-effects
  - stable-audio
  - onnx
  - q4
  - matmulnbits
  - onnxruntime-web
  - webgpu
library_name: onnxruntime
base_model: stabilityai/stable-audio-3-small-sfx

stable-audio-3-small-sfx β€” ONNX (browser / WebGPU)

The SFX-specific ONNX graphs for stabilityai/stable-audio-3-small-sfx, for text-to-sound-effect generation in the browser via onnxruntime-web.

This model shares its text encoder (T5Gemma) and autoencoder/decoder with the music model β€” those weights are bit-identical β€” so only the SFX-specific parts live here. Reuse the text encoder and decoder from lsb/stable-audio-3-small-music-onnx.

Files

dit_q4.onnx + dit_q4.data     SFX diffusion transformer, int4 MatMulNBits (~320 MB)
number_conditioner.onnx       SFX duration embedder
padding_embedding.json        SFX prompt padding vector (768 floats, see below)

Linear/MatMul weights are int4 MatMulNBits (block_size 32); everything else (Conv, norms, the embedding Gather) stays fp32 β€” so every op has a WebGPU kernel and the graph runs on the webgpu execution provider (with wasm fallback).

Pipeline (identical to the music model)

tokens β†’ text_encoder(lsb) β†’ [override pad rows with padding_embedding]
seconds β†’ number_conditioner β†’ duration token
cross = [text(256); duration(1)] (1,257,768);  global = duration (1,768)
pingpong sampler, 8 steps, LogSNRShift(rate=0, anchor_logsnr=-6.2, logsnr_end=2.0), CFG off
latents β†’ decoder(lsb) β†’ stereo 44.1 kHz

DiT I/O

x               float32 (1, 256, t_lat)
t               float32 (1,)
cross_attn_cond float32 (1, 257, 768)
global_embed    float32 (1, 768)
local_add_cond  float32 (1, 257, t_lat)   zeros for plain text-to-audio
padding_mask    bool    (1, t_lat)        all-true
β†’ out           float32 (1, 256, t_lat)

t_lat = ceil((seconds + 6) Β· 44100 / 8192) Β· 2.

padding_embedding

The prompt conditioner replaces padded token positions (where attention_mask == 0) with a learned vector. That vector differs between the SFX and music models, so after running the (shared) text encoder, overwrite the padded rows of last_hidden_state with this 768-vector.

License

Inherits the Stability AI Community License; the T5Gemma text encoder additionally falls under Google's Gemma Terms of Use.