| --- |
| language: en |
| license: mit |
| library_name: onnxruntime |
| pipeline_tag: audio-to-audio |
| tags: |
| - onnx |
| - onnxruntime |
| - stem-separation |
| - source-separation |
| - vocal-remover |
| - karaoke |
| - acapella |
| - demucs |
| - htdemucs |
| - music |
| - audio-to-audio |
| - mobile |
| - ios |
| - android |
| - coreml |
| - directml |
| - production-ready |
| datasets: |
| - StemSplitio/stem-separation-benchmark-2026 |
| inference: false |
| --- |
| |
| # HT-Demucs (single-file 4-stem) β ONNX |
|
|
| The **first ONNX export of the standard `htdemucs` (non-FT) model** on |
| the Hugging Face Hub. Runs in `onnxruntime` on CPU out of the box, and |
| on CoreML / CUDA / DirectML with a one-line provider change. |
| **No PyTorch required at inference.** |
|
|
| This repo is the single-file companion to |
| [`StemSplitio/htdemucs-ft-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-onnx). |
| You get all 4 stems out of one 316 MB `.onnx` file (`htdemucs.onnx`), |
| or 166 MB if you grab the fp16weights variant. The FT bag is higher |
| quality; this single model is ~30% faster and uses 1 session instead of 4. |
|
|
| --- |
|
|
| ## TL;DR |
|
|
| ```bash |
| # 316 MB fp32 model: |
| pip install onnxruntime numpy soundfile |
| python infer.py your-song.mp3 ./out/ --write-all-stems |
| # writes ./out/{drums,bass,other,vocals}.wav at 44.1 kHz stereo |
| |
| # 166 MB fp16weights variant (same runtime cost): |
| python infer.py your-song.mp3 ./out/ --small --write-all-stems |
| ``` |
|
|
| The repo contains: |
|
|
| - `htdemucs.onnx` β 316 MB, opset 17, parity-verified vs PyTorch fp32. |
| - `htdemucs_fp16weights.onnx` β 166 MB, fp16-stored weights, same runtime memory / latency. |
| - `infer.py` β pure-numpy reference inference (~200 lines, no torch). |
| - `requirements.txt` β three small packages, no PyTorch. |
|
|
| --- |
|
|
| ## Quality |
|
|
| The official `htdemucs` model is the precursor to `htdemucs_ft` β same |
| architecture, single set of weights instead of 4 specialist sub-models. |
| On MUSDB18-HQ: |
|
|
| | Metric | `htdemucs` (this) | `htdemucs_ft` (4-bag) | |
| |---|---:|---:| |
| | Median vocals SDR | ~8.8 dB | **9.19 dB** | |
| | Median drums SDR | ~9.5 dB | **10.11 dB** | |
| | Total model size | **316 MB** | 1.26 GB | |
| | Sessions to load | **1** | 4 | |
| | Speed vs the bag | **~1.4Γ faster** | baseline | |
|
|
| Parity vs PyTorch fp32 (random input, 7.8 s segment): |
|
|
| - `htdemucs.onnx` max abs diff: **6.62 Γ 10β»β΄** |
| - `htdemucs_fp16weights.onnx` max abs diff (vs fp32 weights): **4.6 Γ 10β»β΅** |
|
|
| Both well within the 1e-3 publish threshold. |
|
|
| --- |
|
|
| ## Performance |
|
|
| Single 7.8 s segment, Apple M4 Pro CPU: |
|
|
| | Variant | RAM | Latency | RTF | |
| |---|---:|---:|---:| |
| | `htdemucs.onnx` (fp32) | ~1.1 GB | ~1.6 s | 0.20 | |
| | `htdemucs_fp16weights.onnx` | ~1.1 GB | ~1.6 s | 0.20 | |
| | For comparison: `htdemucs_ft` (4-session bag) | ~4.0 GB | ~6.4 s | 0.49 | |
|
|
| CUDA / DirectML / CoreML EPs are typically β₯ 5Γ faster on real GPUs. |
|
|
| --- |
|
|
| ## Quick start |
|
|
| ### Python |
|
|
| ```python |
| import soundfile as sf |
| import infer |
| |
| audio, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True) |
| stems = infer.separate(audio.T, sr, |
| model_path=infer.DEFAULT_MODEL, |
| providers=["CPUExecutionProvider"]) |
| for stem, arr in stems.items(): |
| sf.write(f"{stem}.wav", arr.T, sr) |
| ``` |
|
|
| ### CLI |
|
|
| ```bash |
| python infer.py your-song.mp3 ./out/ --write-all-stems |
| python infer.py your-song.mp3 ./out/ --providers coreml # macOS arm64 |
| python infer.py your-song.mp3 ./out/ --providers cuda # Linux + NVIDIA |
| python infer.py your-song.mp3 ./out/ --providers dml # Windows + DX12 |
| python infer.py your-song.mp3 ./out/ --small # 166 MB variant |
| ``` |
|
|
| ### Mobile / Web (after pip install `onnxruntime-mobile` or `onnxruntime-web`) |
|
|
| ```swift |
| // iOS / Swift |
| import onnxruntime_objc |
| let opts = try ORTSessionOptions() |
| try opts.appendCoreMLExecutionProvider(with: ORTCoreMLExecutionProviderOptions()) |
| let session = try ORTSession(env: env, |
| modelPath: Bundle.main.path(forResource: "htdemucs", ofType: "onnx")!, |
| sessionOptions: opts) |
| ``` |
|
|
| ```js |
| // Browser / web |
| import * as ort from "onnxruntime-web"; |
| const sess = await ort.InferenceSession.create("htdemucs_fp16weights.onnx", { |
| executionProviders: ["wasm"], |
| }); |
| const t = new ort.Tensor("float32", audioBuffer, [1, 2, 343980]); |
| const out = await sess.run({ mix: t }); // out.stems is (1, 4, 2, 343980) |
| ``` |
|
|
| For a turnkey browser demo with file-picker + chunked overlap-add, see |
| [`demucs-onnx browser-demo`](https://github.com/StemSplit/demucs-onnx#browser-demos). |
|
|
| --- |
|
|
| ## Input / output spec |
|
|
| | Tensor | Name | Shape | Dtype | Notes | |
| |---|---|---|---|---| |
| | Input | `mix` | `(1, 2, 343980)` | float32 | Stereo, 44.1 kHz, 7.8 s segment. Values in [-1, 1]. | |
| | Output | `stems` | `(1, 4, 2, 343980)` | float32 | Stems in order `[drums, bass, other, vocals]`. All 4 are real predictions (unlike the FT specialists). | |
|
|
| For longer audio, chunk with overlap-add β see `infer.py::separate` for |
| a working 60-line implementation. |
|
|
| --- |
|
|
| ## Tooling β `demucs-onnx` Python package |
|
|
| This model can be run (and re-exported from PyTorch) via the open-source |
| [`demucs-onnx`](https://github.com/StemSplit/demucs-onnx) Python package |
| on PyPI. It auto-downloads from this repo on first use, so you don't |
| have to clone or wrangle file paths. |
|
|
| ```bash |
| pip install demucs-onnx |
| |
| # Single-file 4-stem flavor (this repo): |
| demucs-onnx separate song.mp3 stems/ --model htdemucs |
| |
| # Python API: |
| python -c "from demucs_onnx import separate; \ |
| print(separate('song.mp3', model='htdemucs').keys())" |
| ``` |
|
|
| To re-export your own fine-tune: |
|
|
| ```bash |
| pip install 'demucs-onnx[export]' |
| demucs-onnx export htdemucs out/htdemucs.onnx |
| ``` |
|
|
| --- |
|
|
| ## How it was built |
|
|
| The export pipeline lives in the open-source |
| [`demucs-onnx`](https://github.com/StemSplit/demucs-onnx) package at |
| [`demucs_onnx/export/`](https://github.com/StemSplit/demucs-onnx/tree/main/src/demucs_onnx/export). |
| It applies four patches to make `torch.onnx.export` work on htdemucs: |
|
|
| 1. Complex-typed `torch.stft` outputs β `Conv1d` with sin/cos kernels. |
| 2. `model.segment` `fractions.Fraction` β plain `float`. |
| 3. `random.randrange` in transformer pos-embedding β hardcoded `shift=0`. |
| 4. `aten::_native_multi_head_attention` (no ONNX symbolic) β drop-in |
| `nn.MultiheadAttention.forward` built from `Linear`/`bmm`/`softmax`. |
|
|
| These are the four blockers every previous community attempt at "demucs |
| onnx" stalled on. See the [README of the demucs-onnx package](https://github.com/StemSplit/demucs-onnx#the-4-blockers-explained) |
| for the full write-up with code references. |
|
|
| --- |
|
|
| ## Related work |
|
|
| Sibling ONNX repos from the same export pipeline: |
|
|
| | Repo | Format | Stems | Use when | |
| |---|---|---|---| |
| | `htdemucs-onnx` *(this)* | Single file | 4 | Faster startup, fewer sessions, ~30% lower latency than the FT bag. | |
| | [`htdemucs-ft-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-onnx) | Bag of 4 files | 4 | Best SDR, especially on vocals. The default in StemSplit production. | |
| | [`htdemucs-6s-onnx`](https://huggingface.co/StemSplitio/htdemucs-6s-onnx) | Single file | 6 | Need guitar + piano stems on top of the standard 4. | |
| | [`htdemucs-ft-{drums,bass,other,vocals}-onnx`](https://huggingface.co/StemSplitio) | Single specialist | 1 | Fastest single-stem inference; 4Γ faster than the bag. | |
|
|
| Full benchmark across every popular open-source separator: |
| [StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026). |
|
|
| --- |
|
|
| ## Skip the infrastructure β use the StemSplit API |
|
|
| Don't want to bundle a 316 MB model in your app, manage a GPU pool, or |
| write overlap-add chunking? Use the **[StemSplit API](https://stemsplit.io/developers)** |
| instead β same model under the hood, hosted for you, with credits and a |
| dashboard. |
|
|
| - π [stemsplit.io](https://stemsplit.io) |
| - π [Developer docs](https://stemsplit.io/developers/docs) |
| - π [API reference](https://stemsplit.io/developers/reference) |
|
|
| Or use the no-code tools that ship the same model family: |
|
|
| - π€ [Vocal Remover](https://stemsplit.io/vocal-remover) |
| - πΆ [Karaoke Maker](https://stemsplit.io/karaoke-maker) |
| - ποΈ [Acapella Maker](https://stemsplit.io/acapella-maker) |
| - πΊ [YouTube Stem Splitter](https://stemsplit.io/youtube-stem-splitter) |
|
|
| --- |
|
|
| ## License & attribution |
|
|
| This repo is **MIT-licensed**, matching the original HT-Demucs. |
|
|
| ```bibtex |
| @inproceedings{rouard2023hybrid, |
| title = {Hybrid Transformers for Music Source Separation}, |
| author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre}, |
| booktitle = {ICASSP}, |
| year = {2023} |
| } |
| ``` |
|
|
| - Original PyTorch model: [`facebookresearch/demucs`](https://github.com/facebookresearch/demucs) |
| - ONNX export, parity verification, and packaging by [StemSplit](https://stemsplit.io) |
| - Search keywords: **htdemucs onnx**, **demucs onnx single file**, **demucs ios**, |
| **demucs android**, **music source separation onnx**, **stem separation mobile**. |
|
|