| --- |
| license: other |
| license_name: stability-ai-community-license |
| license_link: https://huggingface.co/stabilityai/stable-audio-3-small-music/blob/main/LICENSE.md |
| tags: |
| - audio |
| - audio-to-audio |
| - stable-audio |
| - onnx |
| - q4 |
| - matmulnbits |
| - onnxruntime-web |
| library_name: onnxruntime |
| base_model: stabilityai/stable-audio-3-small-music |
| --- |
| |
| # stable-audio-3-small-music β ONNX audio encoder (int4) |
|
|
| The **encoder** half of the [`stabilityai/stable-audio-3-small-music`](https://huggingface.co/stabilityai/stable-audio-3-small-music) |
| autoencoder, exported to ONNX so audio can be turned into latents **in the browser** |
| via [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web). |
|
|
| It is the missing counterpart to the decoder in |
| [`lsb/stable-audio-3-small-music-onnx`](https://huggingface.co/lsb/stable-audio-3-small-music-onnx): |
| the decoder maps latents β audio, this maps **audio β latents**. Together they enable |
| audio-to-audio (variation), inpainting, and continuation on top of that text-to-music bundle. |
|
|
| The encode path is `patchify (patch_size=256) β SAME encoder (taae_v2) β softnorm bottleneck`. |
| Linear/MatMul weights are quantized to **int4 MatMulNBits** (`block_size=32`, symmetric); |
| convolutions and norms stay fp32. Single file, no external data. |
|
|
| ## I/O |
|
|
| ``` |
| input audio float32 (1, 2, N) stereo, 44.1 kHz, N a multiple of 8192 |
| output latents float32 (1, 256, N/4096) same latent space as the decoder's input |
| ``` |
|
|
| `N` **must be a multiple of 8192 samples** (the model's `audio_align`, so the latent length |
| `t_lat = N/4096` is even). Pad shorter clips with zeros; the latent is laid out in time, so |
| you can trim trailing latent frames that correspond to the padding. |
|
|
| ## Browser usage (sketch) |
|
|
| ```js |
| import * as ort from "onnxruntime-web/wasm"; |
| ort.env.wasm.numThreads = 1; ort.env.wasm.simd = true; |
| |
| const base = "https://huggingface.co/bgkb/onnx-encoder/resolve/main"; |
| const buf = new Uint8Array(await (await fetch(`${base}/encoder_q4.onnx`)).arrayBuffer()); |
| const sess = await ort.InferenceSession.create(buf, { executionProviders: ["wasm"] }); |
| |
| // audio: Float32Array of interleaved-by-channel data [L(0..N-1), R(0..N-1)], N % 8192 === 0 |
| const latents = (await sess.run({ audio: new ort.Tensor("float32", audio, [1, 2, N]) })).latents; |
| // β feed latents to the diffusion (variation) or straight to the decoder (round-trip) |
| ``` |
|
|
| ## Quality |
|
|
| Measured as the full round-trip **ONNX encoder β the q4 browser decoder** on a real clip: |
|
|
| | precision | size | reconstruction SNR | |
| |-----------|------|--------------------| |
| | fp32 | 215 MB | 10.2 dB | |
| | **int4 (this file)** | **36 MB** | **8.2 dB** | |
|
|
| The int4 export is bit-faithful to the fp32 graph except for the quantized weights |
| (latent correlation 0.978 vs fp32). For variation/img2img the encoder latent is re-noised |
| and refined by the diffusion sampler, so the int4 precision loss is not audible in practice. |
|
|
| ## License |
|
|
| Inherits the **Stability AI Community License** from the upstream weights; the bundled |
| T5Gemma text encoder (used elsewhere in the pipeline) additionally falls under Google's |
| Gemma Terms of Use. This repo contains only the audio autoencoder's encoder. |
|
|