bgkb
/

encoder-onnx

onnxruntime-web

Model card Files Files and versions

encoder-onnx / sfx /README.md

bgkb's picture

Add SFX README.md

51015c9 verified 4 days ago

|

history blame contribute delete

2.7 kB

	---
	license: other
	license_name: stability-ai-community-license
	license_link: https://huggingface.co/stabilityai/stable-audio-3-small-sfx/blob/main/LICENSE.md
	tags:
	- audio
	- text-to-audio
	- sound-effects
	- stable-audio
	- onnx
	- q4
	- matmulnbits
	- onnxruntime-web
	- webgpu
	library_name: onnxruntime
	base_model: stabilityai/stable-audio-3-small-sfx
	---

	# stable-audio-3-small-sfx — ONNX (browser / WebGPU)

	The SFX-specific ONNX graphs for [`stabilityai/stable-audio-3-small-sfx`](https://huggingface.co/stabilityai/stable-audio-3-small-sfx),
	for text-to-sound-effect generation in the browser via [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web).

	This model shares its text encoder (T5Gemma) and autoencoder/decoder with the music
	model — those weights are bit-identical — so only the SFX-specific parts live here. Reuse the
	text encoder and decoder from
	[`lsb/stable-audio-3-small-music-onnx`](https://huggingface.co/lsb/stable-audio-3-small-music-onnx).

	## Files

	```
	dit_q4.onnx + dit_q4.data SFX diffusion transformer, int4 MatMulNBits (~320 MB)
	number_conditioner.onnx SFX duration embedder
	padding_embedding.json SFX prompt padding vector (768 floats, see below)
	```

	Linear/MatMul weights are int4 `MatMulNBits` (block_size 32); everything else (Conv, norms,
	the embedding `Gather`) stays fp32 — so every op has a WebGPU kernel and the graph runs on the
	`webgpu` execution provider (with `wasm` fallback).

	## Pipeline (identical to the music model)

	```
	tokens → text_encoder(lsb) → [override pad rows with padding_embedding]
	seconds → number_conditioner → duration token
	cross = [text(256); duration(1)] (1,257,768); global = duration (1,768)
	pingpong sampler, 8 steps, LogSNRShift(rate=0, anchor_logsnr=-6.2, logsnr_end=2.0), CFG off
	latents → decoder(lsb) → stereo 44.1 kHz
	```

	### DiT I/O

	```
	x float32 (1, 256, t_lat)
	t float32 (1,)
	cross_attn_cond float32 (1, 257, 768)
	global_embed float32 (1, 768)
	local_add_cond float32 (1, 257, t_lat) zeros for plain text-to-audio
	padding_mask bool (1, t_lat) all-true
	→ out float32 (1, 256, t_lat)
	```

	`t_lat = ceil((seconds + 6) · 44100 / 8192) · 2`.

	### padding_embedding

	The prompt conditioner replaces padded token positions (where `attention_mask == 0`) with a
	learned vector. That vector differs between the SFX and music models, so after running the
	(shared) text encoder, overwrite the padded rows of `last_hidden_state` with this 768-vector.

	## License

	Inherits the Stability AI Community License; the T5Gemma text encoder additionally falls
	under Google's Gemma Terms of Use.