bgkb
/

encoder-onnx

+---
+license: other
+license_name: stability-ai-community-license
+license_link: https://huggingface.co/stabilityai/stable-audio-3-small-music/blob/main/LICENSE.md
+tags:
+  - audio
+  - audio-to-audio
+  - stable-audio
+  - onnx
+  - q4
+  - matmulnbits
+  - onnxruntime-web
+library_name: onnxruntime
+base_model: stabilityai/stable-audio-3-small-music
+---
+# stable-audio-3-small-music — ONNX audio encoder (int4)
+The **encoder** half of the [`stabilityai/stable-audio-3-small-music`](https://huggingface.co/stabilityai/stable-audio-3-small-music)
+autoencoder, exported to ONNX so audio can be turned into latents **in the browser**
+via [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web).
+It is the missing counterpart to the decoder in
+[`lsb/stable-audio-3-small-music-onnx`](https://huggingface.co/lsb/stable-audio-3-small-music-onnx):
+the decoder maps latents → audio, this maps **audio → latents**. Together they enable
+audio-to-audio (variation), inpainting, and continuation on top of that text-to-music bundle.
+The encode path is `patchify (patch_size=256) → SAME encoder (taae_v2) → softnorm bottleneck`.
+Linear/MatMul weights are quantized to **int4 MatMulNBits** (`block_size=32`, symmetric);
+convolutions and norms stay fp32. Single file, no external data.
+## I/O
+```
+input   audio    float32  (1, 2, N)        stereo, 44.1 kHz, N a multiple of 8192
+output  latents  float32  (1, 256, N/4096) same latent space as the decoder's input
+```
+`N` **must be a multiple of 8192 samples** (the model's `audio_align`, so the latent length
+`t_lat = N/4096` is even). Pad shorter clips with zeros; the latent is laid out in time, so
+you can trim trailing latent frames that correspond to the padding.
+## Browser usage (sketch)
+```js
+import * as ort from "onnxruntime-web/wasm";
+ort.env.wasm.numThreads = 1; ort.env.wasm.simd = true;
+const base = "https://huggingface.co/bgkb/onnx-encoder/resolve/main";
+const buf  = new Uint8Array(await (await fetch(`${base}/encoder_q4.onnx`)).arrayBuffer());
+const sess = await ort.InferenceSession.create(buf, { executionProviders: ["wasm"] });
+// audio: Float32Array of interleaved-by-channel data [L(0..N-1), R(0..N-1)], N % 8192 === 0
+const latents = (await sess.run({ audio: new ort.Tensor("float32", audio, [1, 2, N]) })).latents;
+// → feed latents to the diffusion (variation) or straight to the decoder (round-trip)
+```
+## Quality
+Measured as the full round-trip **ONNX encoder → the q4 browser decoder** on a real clip:
+| precision | size | reconstruction SNR |
+|-----------|------|--------------------|
+| fp32      | 215 MB | 10.2 dB |
+| **int4 (this file)** | **36 MB** | **8.2 dB** |
+The int4 export is bit-faithful to the fp32 graph except for the quantized weights
+(latent correlation 0.978 vs fp32). For variation/img2img the encoder latent is re-noised
+and refined by the diffusion sampler, so the int4 precision loss is not audible in practice.
+## License
+Inherits the **Stability AI Community License** from the upstream weights; the bundled
+T5Gemma text encoder (used elsewhere in the pipeline) additionally falls under Google's
+Gemma Terms of Use. This repo contains only the audio autoencoder's encoder.