--- license: other license_name: stability-ai-community-license license_link: https://huggingface.co/stabilityai/stable-audio-3-small-music/blob/main/LICENSE.md tags: - audio - audio-to-audio - stable-audio - onnx - q4 - matmulnbits - onnxruntime-web library_name: onnxruntime base_model: stabilityai/stable-audio-3-small-music --- # stable-audio-3-small-music — ONNX audio encoder (int4) The **encoder** half of the [`stabilityai/stable-audio-3-small-music`](https://huggingface.co/stabilityai/stable-audio-3-small-music) autoencoder, exported to ONNX so audio can be turned into latents **in the browser** via [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web). It is the missing counterpart to the decoder in [`lsb/stable-audio-3-small-music-onnx`](https://huggingface.co/lsb/stable-audio-3-small-music-onnx): the decoder maps latents → audio, this maps **audio → latents**. Together they enable audio-to-audio (variation), inpainting, and continuation on top of that text-to-music bundle. The encode path is `patchify (patch_size=256) → SAME encoder (taae_v2) → softnorm bottleneck`. Linear/MatMul weights are quantized to **int4 MatMulNBits** (`block_size=32`, symmetric); convolutions and norms stay fp32. Single file, no external data. ## I/O ``` input audio float32 (1, 2, N) stereo, 44.1 kHz, N a multiple of 8192 output latents float32 (1, 256, N/4096) same latent space as the decoder's input ``` `N` **must be a multiple of 8192 samples** (the model's `audio_align`, so the latent length `t_lat = N/4096` is even). Pad shorter clips with zeros; the latent is laid out in time, so you can trim trailing latent frames that correspond to the padding. ## Browser usage (sketch) ```js import * as ort from "onnxruntime-web/wasm"; ort.env.wasm.numThreads = 1; ort.env.wasm.simd = true; const base = "https://huggingface.co/bgkb/onnx-encoder/resolve/main"; const buf = new Uint8Array(await (await fetch(`${base}/encoder_q4.onnx`)).arrayBuffer()); const sess = await ort.InferenceSession.create(buf, { executionProviders: ["wasm"] }); // audio: Float32Array of interleaved-by-channel data [L(0..N-1), R(0..N-1)], N % 8192 === 0 const latents = (await sess.run({ audio: new ort.Tensor("float32", audio, [1, 2, N]) })).latents; // → feed latents to the diffusion (variation) or straight to the decoder (round-trip) ``` ## Quality Measured as the full round-trip **ONNX encoder → the q4 browser decoder** on a real clip: | precision | size | reconstruction SNR | |-----------|------|--------------------| | fp32 | 215 MB | 10.2 dB | | **int4 (this file)** | **36 MB** | **8.2 dB** | The int4 export is bit-faithful to the fp32 graph except for the quantized weights (latent correlation 0.978 vs fp32). For variation/img2img the encoder latent is re-noised and refined by the diffusion sampler, so the int4 precision loss is not audible in practice. ## License Inherits the **Stability AI Community License** from the upstream weights; the bundled T5Gemma text encoder (used elsewhere in the pipeline) additionally falls under Google's Gemma Terms of Use. This repo contains only the audio autoencoder's encoder.