Document encoder I/O, alignment, quality
Browse files
README.md
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: other
|
| 3 |
+
license_name: stability-ai-community-license
|
| 4 |
+
license_link: https://huggingface.co/stabilityai/stable-audio-3-small-music/blob/main/LICENSE.md
|
| 5 |
+
tags:
|
| 6 |
+
- audio
|
| 7 |
+
- audio-to-audio
|
| 8 |
+
- stable-audio
|
| 9 |
+
- onnx
|
| 10 |
+
- q4
|
| 11 |
+
- matmulnbits
|
| 12 |
+
- onnxruntime-web
|
| 13 |
+
library_name: onnxruntime
|
| 14 |
+
base_model: stabilityai/stable-audio-3-small-music
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# stable-audio-3-small-music — ONNX audio encoder (int4)
|
| 18 |
+
|
| 19 |
+
The **encoder** half of the [`stabilityai/stable-audio-3-small-music`](https://huggingface.co/stabilityai/stable-audio-3-small-music)
|
| 20 |
+
autoencoder, exported to ONNX so audio can be turned into latents **in the browser**
|
| 21 |
+
via [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web).
|
| 22 |
+
|
| 23 |
+
It is the missing counterpart to the decoder in
|
| 24 |
+
[`lsb/stable-audio-3-small-music-onnx`](https://huggingface.co/lsb/stable-audio-3-small-music-onnx):
|
| 25 |
+
the decoder maps latents → audio, this maps **audio → latents**. Together they enable
|
| 26 |
+
audio-to-audio (variation), inpainting, and continuation on top of that text-to-music bundle.
|
| 27 |
+
|
| 28 |
+
The encode path is `patchify (patch_size=256) → SAME encoder (taae_v2) → softnorm bottleneck`.
|
| 29 |
+
Linear/MatMul weights are quantized to **int4 MatMulNBits** (`block_size=32`, symmetric);
|
| 30 |
+
convolutions and norms stay fp32. Single file, no external data.
|
| 31 |
+
|
| 32 |
+
## I/O
|
| 33 |
+
|
| 34 |
+
```
|
| 35 |
+
input audio float32 (1, 2, N) stereo, 44.1 kHz, N a multiple of 8192
|
| 36 |
+
output latents float32 (1, 256, N/4096) same latent space as the decoder's input
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
`N` **must be a multiple of 8192 samples** (the model's `audio_align`, so the latent length
|
| 40 |
+
`t_lat = N/4096` is even). Pad shorter clips with zeros; the latent is laid out in time, so
|
| 41 |
+
you can trim trailing latent frames that correspond to the padding.
|
| 42 |
+
|
| 43 |
+
## Browser usage (sketch)
|
| 44 |
+
|
| 45 |
+
```js
|
| 46 |
+
import * as ort from "onnxruntime-web/wasm";
|
| 47 |
+
ort.env.wasm.numThreads = 1; ort.env.wasm.simd = true;
|
| 48 |
+
|
| 49 |
+
const base = "https://huggingface.co/bgkb/onnx-encoder/resolve/main";
|
| 50 |
+
const buf = new Uint8Array(await (await fetch(`${base}/encoder_q4.onnx`)).arrayBuffer());
|
| 51 |
+
const sess = await ort.InferenceSession.create(buf, { executionProviders: ["wasm"] });
|
| 52 |
+
|
| 53 |
+
// audio: Float32Array of interleaved-by-channel data [L(0..N-1), R(0..N-1)], N % 8192 === 0
|
| 54 |
+
const latents = (await sess.run({ audio: new ort.Tensor("float32", audio, [1, 2, N]) })).latents;
|
| 55 |
+
// → feed latents to the diffusion (variation) or straight to the decoder (round-trip)
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
## Quality
|
| 59 |
+
|
| 60 |
+
Measured as the full round-trip **ONNX encoder → the q4 browser decoder** on a real clip:
|
| 61 |
+
|
| 62 |
+
| precision | size | reconstruction SNR |
|
| 63 |
+
|-----------|------|--------------------|
|
| 64 |
+
| fp32 | 215 MB | 10.2 dB |
|
| 65 |
+
| **int4 (this file)** | **36 MB** | **8.2 dB** |
|
| 66 |
+
|
| 67 |
+
The int4 export is bit-faithful to the fp32 graph except for the quantized weights
|
| 68 |
+
(latent correlation 0.978 vs fp32). For variation/img2img the encoder latent is re-noised
|
| 69 |
+
and refined by the diffusion sampler, so the int4 precision loss is not audible in practice.
|
| 70 |
+
|
| 71 |
+
## License
|
| 72 |
+
|
| 73 |
+
Inherits the **Stability AI Community License** from the upstream weights; the bundled
|
| 74 |
+
T5Gemma text encoder (used elsewhere in the pipeline) additionally falls under Google's
|
| 75 |
+
Gemma Terms of Use. This repo contains only the audio autoencoder's encoder.
|