README.md · StemSplitio/htdemucs-onnx at main

htdemucs-onnx / README.md

StemSplit

Add htdemucs ONNX model card, inference script, and requirements

a03a3ce verified 11 days ago

preview code

raw

history blame contribute delete

8.85 kB

	---
	language: en
	license: mit
	library_name: onnxruntime
	pipeline_tag: audio-to-audio
	tags:
	- onnx
	- onnxruntime
	- stem-separation
	- source-separation
	- vocal-remover
	- karaoke
	- acapella
	- demucs
	- htdemucs
	- music
	- audio-to-audio
	- mobile
	- ios
	- android
	- coreml
	- directml
	- production-ready
	datasets:
	- StemSplitio/stem-separation-benchmark-2026
	inference: false
	---

	# HT-Demucs (single-file 4-stem) — ONNX

	The first ONNX export of the standard `htdemucs` (non-FT) model on
	the Hugging Face Hub. Runs in `onnxruntime` on CPU out of the box, and
	on CoreML / CUDA / DirectML with a one-line provider change.
	No PyTorch required at inference.

	This repo is the single-file companion to
	[`StemSplitio/htdemucs-ft-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-onnx).
	You get all 4 stems out of one 316 MB `.onnx` file (`htdemucs.onnx`),
	or 166 MB if you grab the fp16weights variant. The FT bag is higher
	quality; this single model is ~30% faster and uses 1 session instead of 4.

	---

	## TL;DR

	```bash
	# 316 MB fp32 model:
	pip install onnxruntime numpy soundfile
	python infer.py your-song.mp3 ./out/ --write-all-stems
	# writes ./out/{drums,bass,other,vocals}.wav at 44.1 kHz stereo

	# 166 MB fp16weights variant (same runtime cost):
	python infer.py your-song.mp3 ./out/ --small --write-all-stems
	```

	The repo contains:

	- `htdemucs.onnx` — 316 MB, opset 17, parity-verified vs PyTorch fp32.
	- `htdemucs_fp16weights.onnx` — 166 MB, fp16-stored weights, same runtime memory / latency.
	- `infer.py` — pure-numpy reference inference (~200 lines, no torch).
	- `requirements.txt` — three small packages, no PyTorch.

	---

	## Quality

	The official `htdemucs` model is the precursor to `htdemucs_ft` — same
	architecture, single set of weights instead of 4 specialist sub-models.
	On MUSDB18-HQ:

	\| Metric \| `htdemucs` (this) \| `htdemucs_ft` (4-bag) \|
	\|---\|---:\|---:\|
	\| Median vocals SDR \| ~8.8 dB \| 9.19 dB \|
	\| Median drums SDR \| ~9.5 dB \| 10.11 dB \|
	\| Total model size \| 316 MB \| 1.26 GB \|
	\| Sessions to load \| 1 \| 4 \|
	\| Speed vs the bag \| ~1.4× faster \| baseline \|

	Parity vs PyTorch fp32 (random input, 7.8 s segment):

	- `htdemucs.onnx` max abs diff: 6.62 × 10⁻⁴
	- `htdemucs_fp16weights.onnx` max abs diff (vs fp32 weights): 4.6 × 10⁻⁵

	Both well within the 1e-3 publish threshold.

	---

	## Performance

	Single 7.8 s segment, Apple M4 Pro CPU:

	\| Variant \| RAM \| Latency \| RTF \|
	\|---\|---:\|---:\|---:\|
	\| `htdemucs.onnx` (fp32) \| ~1.1 GB \| ~1.6 s \| 0.20 \|
	\| `htdemucs_fp16weights.onnx` \| ~1.1 GB \| ~1.6 s \| 0.20 \|
	\| For comparison: `htdemucs_ft` (4-session bag) \| ~4.0 GB \| ~6.4 s \| 0.49 \|

	CUDA / DirectML / CoreML EPs are typically ≥ 5× faster on real GPUs.

	---

	## Quick start

	### Python

	```python
	import soundfile as sf
	import infer

	audio, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
	stems = infer.separate(audio.T, sr,
	model_path=infer.DEFAULT_MODEL,
	providers=["CPUExecutionProvider"])
	for stem, arr in stems.items():
	sf.write(f"{stem}.wav", arr.T, sr)
	```

	### CLI

	```bash
	python infer.py your-song.mp3 ./out/ --write-all-stems
	python infer.py your-song.mp3 ./out/ --providers coreml # macOS arm64
	python infer.py your-song.mp3 ./out/ --providers cuda # Linux + NVIDIA
	python infer.py your-song.mp3 ./out/ --providers dml # Windows + DX12
	python infer.py your-song.mp3 ./out/ --small # 166 MB variant
	```

	### Mobile / Web (after pip install `onnxruntime-mobile` or `onnxruntime-web`)

	```swift
	// iOS / Swift
	import onnxruntime_objc
	let opts = try ORTSessionOptions()
	try opts.appendCoreMLExecutionProvider(with: ORTCoreMLExecutionProviderOptions())
	let session = try ORTSession(env: env,
	modelPath: Bundle.main.path(forResource: "htdemucs", ofType: "onnx")!,
	sessionOptions: opts)
	```

	```js
	// Browser / web
	import * as ort from "onnxruntime-web";
	const sess = await ort.InferenceSession.create("htdemucs_fp16weights.onnx", {
	executionProviders: ["wasm"],
	});
	const t = new ort.Tensor("float32", audioBuffer, [1, 2, 343980]);
	const out = await sess.run({ mix: t }); // out.stems is (1, 4, 2, 343980)
	```

	For a turnkey browser demo with file-picker + chunked overlap-add, see
	[`demucs-onnx browser-demo`](https://github.com/StemSplit/demucs-onnx#browser-demos).

	---

	## Input / output spec

	\| Tensor \| Name \| Shape \| Dtype \| Notes \|
	\|---\|---\|---\|---\|---\|
	\| Input \| `mix` \| `(1, 2, 343980)` \| float32 \| Stereo, 44.1 kHz, 7.8 s segment. Values in [-1, 1]. \|
	\| Output \| `stems` \| `(1, 4, 2, 343980)` \| float32 \| Stems in order `[drums, bass, other, vocals]`. All 4 are real predictions (unlike the FT specialists). \|

	For longer audio, chunk with overlap-add — see `infer.py::separate` for
	a working 60-line implementation.

	---

	## Tooling — `demucs-onnx` Python package

	This model can be run (and re-exported from PyTorch) via the open-source
	[`demucs-onnx`](https://github.com/StemSplit/demucs-onnx) Python package
	on PyPI. It auto-downloads from this repo on first use, so you don't
	have to clone or wrangle file paths.

	```bash
	pip install demucs-onnx

	# Single-file 4-stem flavor (this repo):
	demucs-onnx separate song.mp3 stems/ --model htdemucs

	# Python API:
	python -c "from demucs_onnx import separate; \
	print(separate('song.mp3', model='htdemucs').keys())"
	```

	To re-export your own fine-tune:

	```bash
	pip install 'demucs-onnx[export]'
	demucs-onnx export htdemucs out/htdemucs.onnx
	```

	---

	## How it was built

	The export pipeline lives in the open-source
	[`demucs-onnx`](https://github.com/StemSplit/demucs-onnx) package at
	[`demucs_onnx/export/`](https://github.com/StemSplit/demucs-onnx/tree/main/src/demucs_onnx/export).
	It applies four patches to make `torch.onnx.export` work on htdemucs:

	1. Complex-typed `torch.stft` outputs → `Conv1d` with sin/cos kernels.
	2. `model.segment` `fractions.Fraction` → plain `float`.
	3. `random.randrange` in transformer pos-embedding → hardcoded `shift=0`.
	4. `aten::_native_multi_head_attention` (no ONNX symbolic) → drop-in
	`nn.MultiheadAttention.forward` built from `Linear`/`bmm`/`softmax`.

	These are the four blockers every previous community attempt at "demucs
	onnx" stalled on. See the [README of the demucs-onnx package](https://github.com/StemSplit/demucs-onnx#the-4-blockers-explained)
	for the full write-up with code references.

	---

	## Related work

	Sibling ONNX repos from the same export pipeline:

	\| Repo \| Format \| Stems \| Use when \|
	\|---\|---\|---\|---\|
	\| `htdemucs-onnx` (this) \| Single file \| 4 \| Faster startup, fewer sessions, ~30% lower latency than the FT bag. \|
	\| [`htdemucs-ft-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-onnx) \| Bag of 4 files \| 4 \| Best SDR, especially on vocals. The default in StemSplit production. \|
	\| [`htdemucs-6s-onnx`](https://huggingface.co/StemSplitio/htdemucs-6s-onnx) \| Single file \| 6 \| Need guitar + piano stems on top of the standard 4. \|
	\| [`htdemucs-ft-{drums,bass,other,vocals}-onnx`](https://huggingface.co/StemSplitio) \| Single specialist \| 1 \| Fastest single-stem inference; 4× faster than the bag. \|

	Full benchmark across every popular open-source separator:
	[StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026).

	---

	## Skip the infrastructure — use the StemSplit API

	Don't want to bundle a 316 MB model in your app, manage a GPU pool, or
	write overlap-add chunking? Use the [StemSplit API](https://stemsplit.io/developers)
	instead — same model under the hood, hosted for you, with credits and a
	dashboard.

	- 🌐 [stemsplit.io](https://stemsplit.io)
	- 📘 [Developer docs](https://stemsplit.io/developers/docs)
	- 🔌 [API reference](https://stemsplit.io/developers/reference)

	Or use the no-code tools that ship the same model family:

	- 🎤 [Vocal Remover](https://stemsplit.io/vocal-remover)
	- 🎶 [Karaoke Maker](https://stemsplit.io/karaoke-maker)
	- 🎙️ [Acapella Maker](https://stemsplit.io/acapella-maker)
	- 📺 [YouTube Stem Splitter](https://stemsplit.io/youtube-stem-splitter)

	---

	## License & attribution

	This repo is MIT-licensed, matching the original HT-Demucs.

	```bibtex
	@inproceedings{rouard2023hybrid,
	title = {Hybrid Transformers for Music Source Separation},
	author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
	booktitle = {ICASSP},
	year = {2023}
	}
	```

	- Original PyTorch model: [`facebookresearch/demucs`](https://github.com/facebookresearch/demucs)
	- ONNX export, parity verification, and packaging by [StemSplit](https://stemsplit.io)
	- Search keywords: htdemucs onnx, demucs onnx single file, demucs ios,
	demucs android, music source separation onnx, stem separation mobile.