bgkb commited on
Commit
c454bd8
·
verified ·
1 Parent(s): 12ada1c

Document encoder I/O, alignment, quality

Browse files
Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: stability-ai-community-license
4
+ license_link: https://huggingface.co/stabilityai/stable-audio-3-small-music/blob/main/LICENSE.md
5
+ tags:
6
+ - audio
7
+ - audio-to-audio
8
+ - stable-audio
9
+ - onnx
10
+ - q4
11
+ - matmulnbits
12
+ - onnxruntime-web
13
+ library_name: onnxruntime
14
+ base_model: stabilityai/stable-audio-3-small-music
15
+ ---
16
+
17
+ # stable-audio-3-small-music — ONNX audio encoder (int4)
18
+
19
+ The **encoder** half of the [`stabilityai/stable-audio-3-small-music`](https://huggingface.co/stabilityai/stable-audio-3-small-music)
20
+ autoencoder, exported to ONNX so audio can be turned into latents **in the browser**
21
+ via [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web).
22
+
23
+ It is the missing counterpart to the decoder in
24
+ [`lsb/stable-audio-3-small-music-onnx`](https://huggingface.co/lsb/stable-audio-3-small-music-onnx):
25
+ the decoder maps latents → audio, this maps **audio → latents**. Together they enable
26
+ audio-to-audio (variation), inpainting, and continuation on top of that text-to-music bundle.
27
+
28
+ The encode path is `patchify (patch_size=256) → SAME encoder (taae_v2) → softnorm bottleneck`.
29
+ Linear/MatMul weights are quantized to **int4 MatMulNBits** (`block_size=32`, symmetric);
30
+ convolutions and norms stay fp32. Single file, no external data.
31
+
32
+ ## I/O
33
+
34
+ ```
35
+ input audio float32 (1, 2, N) stereo, 44.1 kHz, N a multiple of 8192
36
+ output latents float32 (1, 256, N/4096) same latent space as the decoder's input
37
+ ```
38
+
39
+ `N` **must be a multiple of 8192 samples** (the model's `audio_align`, so the latent length
40
+ `t_lat = N/4096` is even). Pad shorter clips with zeros; the latent is laid out in time, so
41
+ you can trim trailing latent frames that correspond to the padding.
42
+
43
+ ## Browser usage (sketch)
44
+
45
+ ```js
46
+ import * as ort from "onnxruntime-web/wasm";
47
+ ort.env.wasm.numThreads = 1; ort.env.wasm.simd = true;
48
+
49
+ const base = "https://huggingface.co/bgkb/onnx-encoder/resolve/main";
50
+ const buf = new Uint8Array(await (await fetch(`${base}/encoder_q4.onnx`)).arrayBuffer());
51
+ const sess = await ort.InferenceSession.create(buf, { executionProviders: ["wasm"] });
52
+
53
+ // audio: Float32Array of interleaved-by-channel data [L(0..N-1), R(0..N-1)], N % 8192 === 0
54
+ const latents = (await sess.run({ audio: new ort.Tensor("float32", audio, [1, 2, N]) })).latents;
55
+ // → feed latents to the diffusion (variation) or straight to the decoder (round-trip)
56
+ ```
57
+
58
+ ## Quality
59
+
60
+ Measured as the full round-trip **ONNX encoder → the q4 browser decoder** on a real clip:
61
+
62
+ | precision | size | reconstruction SNR |
63
+ |-----------|------|--------------------|
64
+ | fp32 | 215 MB | 10.2 dB |
65
+ | **int4 (this file)** | **36 MB** | **8.2 dB** |
66
+
67
+ The int4 export is bit-faithful to the fp32 graph except for the quantized weights
68
+ (latent correlation 0.978 vs fp32). For variation/img2img the encoder latent is re-noised
69
+ and refined by the diffusion sampler, so the int4 precision loss is not audible in practice.
70
+
71
+ ## License
72
+
73
+ Inherits the **Stability AI Community License** from the upstream weights; the bundled
74
+ T5Gemma text encoder (used elsewhere in the pipeline) additionally falls under Google's
75
+ Gemma Terms of Use. This repo contains only the audio autoencoder's encoder.