File size: 3,196 Bytes
c454bd8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | ---
license: other
license_name: stability-ai-community-license
license_link: https://huggingface.co/stabilityai/stable-audio-3-small-music/blob/main/LICENSE.md
tags:
- audio
- audio-to-audio
- stable-audio
- onnx
- q4
- matmulnbits
- onnxruntime-web
library_name: onnxruntime
base_model: stabilityai/stable-audio-3-small-music
---
# stable-audio-3-small-music — ONNX audio encoder (int4)
The **encoder** half of the [`stabilityai/stable-audio-3-small-music`](https://huggingface.co/stabilityai/stable-audio-3-small-music)
autoencoder, exported to ONNX so audio can be turned into latents **in the browser**
via [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web).
It is the missing counterpart to the decoder in
[`lsb/stable-audio-3-small-music-onnx`](https://huggingface.co/lsb/stable-audio-3-small-music-onnx):
the decoder maps latents → audio, this maps **audio → latents**. Together they enable
audio-to-audio (variation), inpainting, and continuation on top of that text-to-music bundle.
The encode path is `patchify (patch_size=256) → SAME encoder (taae_v2) → softnorm bottleneck`.
Linear/MatMul weights are quantized to **int4 MatMulNBits** (`block_size=32`, symmetric);
convolutions and norms stay fp32. Single file, no external data.
## I/O
```
input audio float32 (1, 2, N) stereo, 44.1 kHz, N a multiple of 8192
output latents float32 (1, 256, N/4096) same latent space as the decoder's input
```
`N` **must be a multiple of 8192 samples** (the model's `audio_align`, so the latent length
`t_lat = N/4096` is even). Pad shorter clips with zeros; the latent is laid out in time, so
you can trim trailing latent frames that correspond to the padding.
## Browser usage (sketch)
```js
import * as ort from "onnxruntime-web/wasm";
ort.env.wasm.numThreads = 1; ort.env.wasm.simd = true;
const base = "https://huggingface.co/bgkb/onnx-encoder/resolve/main";
const buf = new Uint8Array(await (await fetch(`${base}/encoder_q4.onnx`)).arrayBuffer());
const sess = await ort.InferenceSession.create(buf, { executionProviders: ["wasm"] });
// audio: Float32Array of interleaved-by-channel data [L(0..N-1), R(0..N-1)], N % 8192 === 0
const latents = (await sess.run({ audio: new ort.Tensor("float32", audio, [1, 2, N]) })).latents;
// → feed latents to the diffusion (variation) or straight to the decoder (round-trip)
```
## Quality
Measured as the full round-trip **ONNX encoder → the q4 browser decoder** on a real clip:
| precision | size | reconstruction SNR |
|-----------|------|--------------------|
| fp32 | 215 MB | 10.2 dB |
| **int4 (this file)** | **36 MB** | **8.2 dB** |
The int4 export is bit-faithful to the fp32 graph except for the quantized weights
(latent correlation 0.978 vs fp32). For variation/img2img the encoder latent is re-noised
and refined by the diffusion sampler, so the int4 precision loss is not audible in practice.
## License
Inherits the **Stability AI Community License** from the upstream weights; the bundled
T5Gemma text encoder (used elsewhere in the pipeline) additionally falls under Google's
Gemma Terms of Use. This repo contains only the audio autoencoder's encoder.
|