bgkb commited on
Commit
51015c9
·
verified ·
1 Parent(s): 8596776

Add SFX README.md

Browse files
Files changed (1) hide show
  1. sfx/README.md +74 -0
sfx/README.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: stability-ai-community-license
4
+ license_link: https://huggingface.co/stabilityai/stable-audio-3-small-sfx/blob/main/LICENSE.md
5
+ tags:
6
+ - audio
7
+ - text-to-audio
8
+ - sound-effects
9
+ - stable-audio
10
+ - onnx
11
+ - q4
12
+ - matmulnbits
13
+ - onnxruntime-web
14
+ - webgpu
15
+ library_name: onnxruntime
16
+ base_model: stabilityai/stable-audio-3-small-sfx
17
+ ---
18
+
19
+ # stable-audio-3-small-sfx — ONNX (browser / WebGPU)
20
+
21
+ The **SFX-specific** ONNX graphs for [`stabilityai/stable-audio-3-small-sfx`](https://huggingface.co/stabilityai/stable-audio-3-small-sfx),
22
+ for text-to-sound-effect generation in the browser via [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web).
23
+
24
+ This model shares its **text encoder** (T5Gemma) and **autoencoder/decoder** with the music
25
+ model — those weights are bit-identical — so only the SFX-specific parts live here. Reuse the
26
+ text encoder and decoder from
27
+ [`lsb/stable-audio-3-small-music-onnx`](https://huggingface.co/lsb/stable-audio-3-small-music-onnx).
28
+
29
+ ## Files
30
+
31
+ ```
32
+ dit_q4.onnx + dit_q4.data SFX diffusion transformer, int4 MatMulNBits (~320 MB)
33
+ number_conditioner.onnx SFX duration embedder
34
+ padding_embedding.json SFX prompt padding vector (768 floats, see below)
35
+ ```
36
+
37
+ Linear/MatMul weights are int4 `MatMulNBits` (block_size 32); everything else (Conv, norms,
38
+ the embedding `Gather`) stays fp32 — so every op has a WebGPU kernel and the graph runs on the
39
+ `webgpu` execution provider (with `wasm` fallback).
40
+
41
+ ## Pipeline (identical to the music model)
42
+
43
+ ```
44
+ tokens → text_encoder(lsb) → [override pad rows with padding_embedding]
45
+ seconds → number_conditioner → duration token
46
+ cross = [text(256); duration(1)] (1,257,768); global = duration (1,768)
47
+ pingpong sampler, 8 steps, LogSNRShift(rate=0, anchor_logsnr=-6.2, logsnr_end=2.0), CFG off
48
+ latents → decoder(lsb) → stereo 44.1 kHz
49
+ ```
50
+
51
+ ### DiT I/O
52
+
53
+ ```
54
+ x float32 (1, 256, t_lat)
55
+ t float32 (1,)
56
+ cross_attn_cond float32 (1, 257, 768)
57
+ global_embed float32 (1, 768)
58
+ local_add_cond float32 (1, 257, t_lat) zeros for plain text-to-audio
59
+ padding_mask bool (1, t_lat) all-true
60
+ → out float32 (1, 256, t_lat)
61
+ ```
62
+
63
+ `t_lat = ceil((seconds + 6) · 44100 / 8192) · 2`.
64
+
65
+ ### padding_embedding
66
+
67
+ The prompt conditioner replaces padded token positions (where `attention_mask == 0`) with a
68
+ learned vector. That vector differs between the SFX and music models, so after running the
69
+ (shared) text encoder, overwrite the padded rows of `last_hidden_state` with this 768-vector.
70
+
71
+ ## License
72
+
73
+ Inherits the **Stability AI Community License**; the T5Gemma text encoder additionally falls
74
+ under Google's Gemma Terms of Use.