stable-audio-3-small-music β ONNX audio encoder (int4)
The encoder half of the stabilityai/stable-audio-3-small-music
autoencoder, exported to ONNX so audio can be turned into latents in the browser
via onnxruntime-web.
It is the missing counterpart to the decoder in
lsb/stable-audio-3-small-music-onnx:
the decoder maps latents β audio, this maps audio β latents. Together they enable
audio-to-audio (variation), inpainting, and continuation on top of that text-to-music bundle.
The encode path is patchify (patch_size=256) β SAME encoder (taae_v2) β softnorm bottleneck.
Linear/MatMul weights are quantized to int4 MatMulNBits (block_size=32, symmetric);
convolutions and norms stay fp32. Single file, no external data.
I/O
input audio float32 (1, 2, N) stereo, 44.1 kHz, N a multiple of 8192
output latents float32 (1, 256, N/4096) same latent space as the decoder's input
N must be a multiple of 8192 samples (the model's audio_align, so the latent length
t_lat = N/4096 is even). Pad shorter clips with zeros; the latent is laid out in time, so
you can trim trailing latent frames that correspond to the padding.
Browser usage (sketch)
import * as ort from "onnxruntime-web/wasm";
ort.env.wasm.numThreads = 1; ort.env.wasm.simd = true;
const base = "https://huggingface.co/bgkb/onnx-encoder/resolve/main";
const buf = new Uint8Array(await (await fetch(`${base}/encoder_q4.onnx`)).arrayBuffer());
const sess = await ort.InferenceSession.create(buf, { executionProviders: ["wasm"] });
// audio: Float32Array of interleaved-by-channel data [L(0..N-1), R(0..N-1)], N % 8192 === 0
const latents = (await sess.run({ audio: new ort.Tensor("float32", audio, [1, 2, N]) })).latents;
// β feed latents to the diffusion (variation) or straight to the decoder (round-trip)
Quality
Measured as the full round-trip ONNX encoder β the q4 browser decoder on a real clip:
| precision | size | reconstruction SNR |
|---|---|---|
| fp32 | 215 MB | 10.2 dB |
| int4 (this file) | 36 MB | 8.2 dB |
The int4 export is bit-faithful to the fp32 graph except for the quantized weights (latent correlation 0.978 vs fp32). For variation/img2img the encoder latent is re-noised and refined by the diffusion sampler, so the int4 precision loss is not audible in practice.
License
Inherits the Stability AI Community License from the upstream weights; the bundled T5Gemma text encoder (used elsewhere in the pipeline) additionally falls under Google's Gemma Terms of Use. This repo contains only the audio autoencoder's encoder.
Model tree for bgkb/encoder-onnx
Base model
stabilityai/stable-audio-3-small-music-base