stable-audio-3-small-music — ONNX audio encoder (int4)

The encoder half of the stabilityai/stable-audio-3-small-music autoencoder, exported to ONNX so audio can be turned into latents in the browser via onnxruntime-web.

It is the missing counterpart to the decoder in lsb/stable-audio-3-small-music-onnx: the decoder maps latents → audio, this maps audio → latents. Together they enable audio-to-audio (variation), inpainting, and continuation on top of that text-to-music bundle.

The encode path is patchify (patch_size=256) → SAME encoder (taae_v2) → softnorm bottleneck. Linear/MatMul weights are quantized to int4 MatMulNBits (block_size=32, symmetric); convolutions and norms stay fp32. Single file, no external data.

I/O

input   audio    float32  (1, 2, N)        stereo, 44.1 kHz, N a multiple of 8192
output  latents  float32  (1, 256, N/4096) same latent space as the decoder's input

N must be a multiple of 8192 samples (the model's audio_align, so the latent length t_lat = N/4096 is even). Pad shorter clips with zeros; the latent is laid out in time, so you can trim trailing latent frames that correspond to the padding.

Browser usage (sketch)

import * as ort from "onnxruntime-web/wasm";
ort.env.wasm.numThreads = 1; ort.env.wasm.simd = true;

const base = "https://huggingface.co/bgkb/onnx-encoder/resolve/main";
const buf  = new Uint8Array(await (await fetch(`${base}/encoder_q4.onnx`)).arrayBuffer());
const sess = await ort.InferenceSession.create(buf, { executionProviders: ["wasm"] });

// audio: Float32Array of interleaved-by-channel data [L(0..N-1), R(0..N-1)], N % 8192 === 0
const latents = (await sess.run({ audio: new ort.Tensor("float32", audio, [1, 2, N]) })).latents;
// → feed latents to the diffusion (variation) or straight to the decoder (round-trip)

Quality

Measured as the full round-trip ONNX encoder → the q4 browser decoder on a real clip:

precision	size	reconstruction SNR
fp32	215 MB	10.2 dB
int4 (this file)	36 MB	8.2 dB

The int4 export is bit-faithful to the fp32 graph except for the quantized weights (latent correlation 0.978 vs fp32). For variation/img2img the encoder latent is re-noised and refined by the diffusion sampler, so the int4 precision loss is not audible in practice.

License

Inherits the Stability AI Community License from the upstream weights; the bundled T5Gemma text encoder (used elsewhere in the pipeline) additionally falls under Google's Gemma Terms of Use. This repo contains only the audio autoencoder's encoder.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for bgkb/encoder-onnx

Base model

stabilityai/stable-audio-3-small-music-base

Finetuned

stabilityai/stable-audio-3-small-music

Quantized

(5)

this model