bgkb
/

encoder-onnx

onnxruntime-web

Model card Files Files and versions

encoder-onnx / README.md

bgkb's picture

Document encoder I/O, alignment, quality

c454bd8 verified 3 days ago

|

history blame contribute delete

3.2 kB

	---
	license: other
	license_name: stability-ai-community-license
	license_link: https://huggingface.co/stabilityai/stable-audio-3-small-music/blob/main/LICENSE.md
	tags:
	- audio
	- audio-to-audio
	- stable-audio
	- onnx
	- q4
	- matmulnbits
	- onnxruntime-web
	library_name: onnxruntime
	base_model: stabilityai/stable-audio-3-small-music
	---

	# stable-audio-3-small-music — ONNX audio encoder (int4)

	The encoder half of the [`stabilityai/stable-audio-3-small-music`](https://huggingface.co/stabilityai/stable-audio-3-small-music)
	autoencoder, exported to ONNX so audio can be turned into latents in the browser
	via [`onnxruntime-web`](https://www.npmjs.com/package/onnxruntime-web).

	It is the missing counterpart to the decoder in
	[`lsb/stable-audio-3-small-music-onnx`](https://huggingface.co/lsb/stable-audio-3-small-music-onnx):
	the decoder maps latents → audio, this maps audio → latents. Together they enable
	audio-to-audio (variation), inpainting, and continuation on top of that text-to-music bundle.

	The encode path is `patchify (patch_size=256) → SAME encoder (taae_v2) → softnorm bottleneck`.
	Linear/MatMul weights are quantized to int4 MatMulNBits (`block_size=32`, symmetric);
	convolutions and norms stay fp32. Single file, no external data.

	## I/O

	```
	input audio float32 (1, 2, N) stereo, 44.1 kHz, N a multiple of 8192
	output latents float32 (1, 256, N/4096) same latent space as the decoder's input
	```

	`N` must be a multiple of 8192 samples (the model's `audio_align`, so the latent length
	`t_lat = N/4096` is even). Pad shorter clips with zeros; the latent is laid out in time, so
	you can trim trailing latent frames that correspond to the padding.

	## Browser usage (sketch)

	```js
	import * as ort from "onnxruntime-web/wasm";
	ort.env.wasm.numThreads = 1; ort.env.wasm.simd = true;

	const base = "https://huggingface.co/bgkb/onnx-encoder/resolve/main";
	const buf = new Uint8Array(await (await fetch(`${base}/encoder_q4.onnx`)).arrayBuffer());
	const sess = await ort.InferenceSession.create(buf, { executionProviders: ["wasm"] });

	// audio: Float32Array of interleaved-by-channel data [L(0..N-1), R(0..N-1)], N % 8192 === 0
	const latents = (await sess.run({ audio: new ort.Tensor("float32", audio, [1, 2, N]) })).latents;
	// → feed latents to the diffusion (variation) or straight to the decoder (round-trip)
	```

	## Quality

	Measured as the full round-trip ONNX encoder → the q4 browser decoder on a real clip:

	\| precision \| size \| reconstruction SNR \|
	\|-----------\|------\|--------------------\|
	\| fp32 \| 215 MB \| 10.2 dB \|
	\| int4 (this file) \| 36 MB \| 8.2 dB \|

	The int4 export is bit-faithful to the fp32 graph except for the quantized weights
	(latent correlation 0.978 vs fp32). For variation/img2img the encoder latent is re-noised
	and refined by the diffusion sampler, so the int4 precision loss is not audible in practice.

	## License

	Inherits the Stability AI Community License from the upstream weights; the bundled
	T5Gemma text encoder (used elsewhere in the pipeline) additionally falls under Google's
	Gemma Terms of Use. This repo contains only the audio autoencoder's encoder.