File size: 3,196 Bytes
c454bd8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
license: other
license_name: stability-ai-community-license
license_link: https://huggingface.co/stabilityai/stable-audio-3-small-music/blob/main/LICENSE.md
tags:
  - audio
  - audio-to-audio
  - stable-audio
  - onnx
  - q4
  - matmulnbits
  - onnxruntime-web
library_name: onnxruntime
base_model: stabilityai/stable-audio-3-small-music
---

# stable-audio-3-small-music — ONNX audio encoder (int4)

The **encoder** half of the [`stabilityai/stable-audio-3-small-music`](https://huggingface.co/stabilityai/stable-audio-3-small-music)
autoencoder, exported to ONNX so audio can be turned into latents **in the browser**
via [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web).

It is the missing counterpart to the decoder in
[`lsb/stable-audio-3-small-music-onnx`](https://huggingface.co/lsb/stable-audio-3-small-music-onnx):
the decoder maps latents → audio, this maps **audio → latents**. Together they enable
audio-to-audio (variation), inpainting, and continuation on top of that text-to-music bundle.

The encode path is `patchify (patch_size=256) → SAME encoder (taae_v2) → softnorm bottleneck`.
Linear/MatMul weights are quantized to **int4 MatMulNBits** (`block_size=32`, symmetric);
convolutions and norms stay fp32. Single file, no external data.

## I/O

```
input   audio    float32  (1, 2, N)        stereo, 44.1 kHz, N a multiple of 8192
output  latents  float32  (1, 256, N/4096) same latent space as the decoder's input
```

`N` **must be a multiple of 8192 samples** (the model's `audio_align`, so the latent length
`t_lat = N/4096` is even). Pad shorter clips with zeros; the latent is laid out in time, so
you can trim trailing latent frames that correspond to the padding.

## Browser usage (sketch)

```js
import * as ort from "onnxruntime-web/wasm";
ort.env.wasm.numThreads = 1; ort.env.wasm.simd = true;

const base = "https://huggingface.co/bgkb/onnx-encoder/resolve/main";
const buf  = new Uint8Array(await (await fetch(`${base}/encoder_q4.onnx`)).arrayBuffer());
const sess = await ort.InferenceSession.create(buf, { executionProviders: ["wasm"] });

// audio: Float32Array of interleaved-by-channel data [L(0..N-1), R(0..N-1)], N % 8192 === 0
const latents = (await sess.run({ audio: new ort.Tensor("float32", audio, [1, 2, N]) })).latents;
// → feed latents to the diffusion (variation) or straight to the decoder (round-trip)
```

## Quality

Measured as the full round-trip **ONNX encoder → the q4 browser decoder** on a real clip:

| precision | size | reconstruction SNR |
|-----------|------|--------------------|
| fp32      | 215 MB | 10.2 dB |
| **int4 (this file)** | **36 MB** | **8.2 dB** |

The int4 export is bit-faithful to the fp32 graph except for the quantized weights
(latent correlation 0.978 vs fp32). For variation/img2img the encoder latent is re-noised
and refined by the diffusion sampler, so the int4 precision loss is not audible in practice.

## License

Inherits the **Stability AI Community License** from the upstream weights; the bundled
T5Gemma text encoder (used elsewhere in the pipeline) additionally falls under Google's
Gemma Terms of Use. This repo contains only the audio autoencoder's encoder.