Initial release: complete 4-stem htdemucs_ft ONNX bag (drums/bass/other/vocals) + numpy aggregator

Browse files

Files changed (3) hide show

README.md +257 -0
bag_infer.py +194 -0
requirements.txt +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,257 @@

+---
+language: en
+license: mit
+library_name: onnxruntime
+pipeline_tag: audio-to-audio
+tags:
+  - onnx
+  - onnxruntime
+  - stem-separation
+  - source-separation
+  - vocal-isolation
+  - vocal-remover
+  - drum-extraction
+  - bass-extraction
+  - karaoke
+  - demucs
+  - htdemucs
+  - music
+  - audio-to-audio
+  - mobile
+  - ios
+  - android
+  - coreml
+  - directml
+  - production-ready
+datasets:
+  - StemSplitio/stem-separation-benchmark-2026
+inference: false
+---
+# HT-Demucs FT — Full 4-Stem Bag, ONNX
+**The first complete ONNX export of HT-Demucs FT on the Hugging Face Hub.**
+Four parity-verified ONNX models (drums, bass, other, vocals) plus a
+~250-line numpy aggregator that runs the full 4-stem separation in pure
+`onnxruntime`. **No PyTorch required at inference.** Runs on CPU /
+CoreML / CUDA / DirectML.
+This repo is the convenience drop — all 4 specialist sub-models of
+`htdemucs_ft` in one place, with a working bag-inference script. If you
+only need one stem in production, the individual stem-specialist repos
+below are ~75% smaller and ~4× faster per song.
+---
+## TL;DR
+```bash
+pip install onnxruntime numpy soundfile
+python bag_infer.py your-song.mp3 ./out/
+# writes out/drums.wav, out/bass.wav, out/other.wav, out/vocals.wav
+```
+That's it. The 4 `.onnx` files (316 MB each, ~1.26 GB total) live
+alongside the script.
+---
+## Quality
+Median per-stem SDR on the MUSDB18-HQ test split (50 songs), BSS Eval v4
+via `museval`. **Identical to the official PyTorch `htdemucs_ft`** — the
+bag's per-stem output IS the corresponding specialist's output (the weight
+matrix is one-hot per stem).
+| Stem | SDR (dB) | Rank in our 2026 benchmark |
+|---|---:|---|
+| **vocals** | **9.19** | **#1** (highest open-source vocal SDR) |
+| drums | 10.11 | #2 (mdx_extra_q leads at 11.49) |
+| bass | 10.38 | #2 (mdx_extra_q leads at 11.42) |
+| other | 6.34 | #2 (mdx_extra_q leads at 7.67) |
+Full benchmark across every popular open-source separator:
+[StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026).
+**ONNX vs PyTorch parity:** verified to < 1e-3 max abs diff on every stem
+during export. See the
+[Day 1 spike report](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx#how-it-was-built)
+for the full engineering writeup.
+---
+## Performance
+Real measurements on an Apple M4 Pro:
+| Mode | Hardware | Per 3-min song | Notes |
+|---|---|---:|---|
+| One specialist (`htdemucs-ft-drums-onnx`) | M4 Pro CPU | **~22 s** | 4× faster, 75% smaller — use this if you only need one stem |
+| **Full bag (this repo)** | M4 Pro CPU | **~88 s** | RTF ~0.5. 4 sub-models × N chunks. |
+| Full bag | M4 Pro CPU (8 threads) | ~60 s | With `OMP_NUM_THREADS=8` and SessionOptions tuned |
+| Full bag | NVIDIA L4 CUDA | ~6 s | Extrapolated from per-specialist CUDA numbers |
+| Full bag | NVIDIA T4 | ~16 s | Extrapolated |
+| PyTorch full bag | M4 Pro MPS | ~47 s | Faster only because MPS is GPU-accelerated; ONNX-CUDA beats it cleanly. |
+---
+## Common use cases
+- **Karaoke makers** — `out/other.wav` minus `out/vocals.wav` gives a clean
+  karaoke track plus an acapella in one pass.
+- **DAW stem export** — drop the 4 `.wav` files into Ableton / Logic /
+  Reaper as separate channels for remixing.
+- **DJ stems software** — load all 4 stems as live-mixable tracks.
+- **AI music apps** — feed each stem into downstream models (drum
+  transcription, bassline-to-MIDI, vocal pitch correction).
+- **Acapella sampling** — clean isolated vocals at the highest SDR
+  available in open source.
+- **Mobile / on-device separation** — replaces a 1+ GB PyTorch install
+  with `onnxruntime`'s 50 MB binary on iOS / Android.
+---
+## Quick start
+### Python — as a library
+```python
+import bag_infer
+stems = bag_infer.separate_all("your-song.mp3")
+# stems: dict[str, numpy.ndarray (2, samples)]
+#   stems["drums"], stems["bass"], stems["other"], stems["vocals"]
+```
+### Python — with execution provider control
+```python
+import soundfile as sf
+import bag_infer
+audio, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
+stems = bag_infer.separate(
+    audio.T, sr,
+    providers=["CPUExecutionProvider"],  # or "CoreMLExecutionProvider", etc.
+)
+for name, audio in stems.items():
+    sf.write(f"{name}.wav", audio.T, sr)
+```
+### CLI
+```bash
+python bag_infer.py your-song.mp3 ./out/
+python bag_infer.py your-song.mp3 ./out/ --providers cuda
+python bag_infer.py your-song.mp3 ./out/ --providers coreml
+python bag_infer.py your-song.mp3 ./out/ --providers dml
+```
+### Web / mobile
+Each specialist is a vanilla onnxruntime model; just load all 4 sessions
+and reuse the aggregation logic in `bag_infer.py::separate`. See the
+individual stem repos for platform-specific snippets:
+[drums](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx) ·
+[bass](https://huggingface.co/StemSplitio/htdemucs-ft-bass-onnx) ·
+[other](https://huggingface.co/StemSplitio/htdemucs-ft-other-onnx) ·
+[vocals](https://huggingface.co/StemSplitio/htdemucs-ft-vocals-onnx).
+---
+## How aggregation works
+The `htdemucs_ft` bag uses a **one-hot weight matrix** for combining the 4
+sub-models — model 0's drums output is used directly as the bag's drums
+stem, model 1's bass output is the bag's bass stem, and so on. No
+weighted-sum aggregation needed.
+That means:
+- **The bag's drums stem == the drums specialist's drums output** (bit-exact in fp32)
+- Same for bass, other, vocals
+- So you can ship only the specialists you need and get identical
+  per-stem quality to the full bag at 1/4 the size
+`bag_infer.py` simply runs all 4 specialists and picks the relevant row
+from each. ~30 lines of numpy.
+---
+## Input / output spec per sub-model
+| Tensor | Name | Shape | Dtype | Notes |
+|---|---|---|---|---|
+| Input | `mix` | `(1, 2, 343980)` | float32 | Stereo audio, 44.1 kHz, 7.8 s segment. |
+| Output | `stems` | `(1, 4, 2, 343980)` | float32 | `[drums, bass, other, vocals]`. Use only the specialist's target row. |
+For longer audio, the bag script handles overlap-add chunking.
+---
+## Files in this repo
+| File | Size | Purpose |
+|---|---:|---|
+| `htdemucs_ft_drums.onnx`  | 316 MB | Drums specialist (bag index 0) |
+| `htdemucs_ft_bass.onnx`   | 316 MB | Bass specialist (bag index 1) |
+| `htdemucs_ft_other.onnx`  | 316 MB | Other specialist (bag index 2) |
+| `htdemucs_ft_vocals.onnx` | 316 MB | Vocals specialist (bag index 3) |
+| `bag_infer.py` | 7 KB | Pure numpy aggregator. No torch. |
+| `requirements.txt` | <1 KB | `onnxruntime`, `numpy`, `soundfile`. |
+| `README.md` | this file | |
+Total: **~1.26 GB**. If that's too big, use individual stem repos.
+---
+## Related work
+| Repo | Stem | Use when |
+|---|---|---|
+| [`htdemucs-ft-drums-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-drums-onnx) | drums | Only need drums (1/4 size, 1/4 latency) |
+| [`htdemucs-ft-bass-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-bass-onnx) | bass | Only need bass |
+| [`htdemucs-ft-other-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-other-onnx) | other | Only need "other" / instrumental |
+| [`htdemucs-ft-vocals-onnx`](https://huggingface.co/StemSplitio/htdemucs-ft-vocals-onnx) | vocals | **#1 open-source vocal SDR** |
+PyTorch versions for HF Inference Endpoints:
+[`htdemucs-ft-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-pytorch)
+and its [4 sibling specialist repos](https://huggingface.co/StemSplitio).
+---
+## Skip the infrastructure — use the StemSplit API
+Don't want to ship 1.26 GB of `.onnx` files in your app, manage a GPU
+pool, or write overlap-add chunking? Use the
+**[StemSplit API](https://stemsplit.io/developers)** instead — same models
+under the hood, hosted for you, with credits and a dashboard.
+- 🌐 [stemsplit.io](https://stemsplit.io)
+- 📘 [Developer docs](https://stemsplit.io/developers/docs)
+- 🔌 [API reference](https://stemsplit.io/developers/reference)
+Or use the no-code tools that ship this same model family:
+- 🎤 [Vocal Remover](https://stemsplit.io/vocal-remover)
+- 🎶 [Karaoke Maker](https://stemsplit.io/karaoke-maker)
+- 🎙️ [Acapella Maker](https://stemsplit.io/acapella-maker)
+- 📺 [YouTube Stem Splitter](https://stemsplit.io/youtube-stem-splitter)
+---
+## License & attribution
+MIT-licensed, matching the original HT-Demucs.
+```bibtex
+@inproceedings{rouard2023hybrid,
+  title     = {Hybrid Transformers for Music Source Separation},
+  author    = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
+  booktitle = {ICASSP},
+  year      = {2023}
+}
+```
+- Original PyTorch model: [`facebookresearch/demucs`](https://github.com/facebookresearch/demucs)
+- ONNX export, parity verification, and packaging by [StemSplit](https://stemsplit.io)
+- Search keywords: htdemucs onnx, demucs onnx, htdemucs bag onnx, demucs ios, demucs android, music source separation onnx, 4-stem separation onnx, stem separation mobile, onnxruntime music separation

bag_infer.py ADDED Viewed

	@@ -0,0 +1,194 @@

+"""
+Bag inference for the full HT-Demucs FT 4-stem ONNX ensemble.
+Runs all 4 specialist sub-models and aggregates their outputs using the
+htdemucs_ft bag's one-hot weight matrix (drums-model -> drums stem only,
+bass-model -> bass stem only, etc).
+NO TORCH at inference. Just numpy + onnxruntime + soundfile.
+Usage:
+    python bag_infer.py your-song.mp3 ./out/
+    # writes out/drums.wav, out/bass.wav, out/other.wav, out/vocals.wav
+Or as a library:
+    import bag_infer
+    stems = bag_infer.separate_all("song.mp3")
+    # stems: dict[str, numpy.ndarray (2, samples)]
+"""
+from __future__ import annotations
+import argparse
+import sys
+import time
+from pathlib import Path
+import numpy as np
+import onnxruntime as ort
+import soundfile as sf
+SAMPLE_RATE = 44100
+SEGMENT_S = 7.8
+N_SAMPLES = int(SEGMENT_S * SAMPLE_RATE)  # 343,980
+N_CHANNELS = 2
+SOURCES = ["drums", "bass", "other", "vocals"]
+HERE = Path(__file__).resolve().parent
+# The bag's weight matrix for htdemucs_ft is one-hot per stem:
+#   drums  specialist (bag.models[0]) -> contributes only to drums stem
+#   bass   specialist (bag.models[1]) -> contributes only to bass stem
+#   other  specialist (bag.models[2]) -> contributes only to other stem
+#   vocals specialist (bag.models[3]) -> contributes only to vocals stem
+# So aggregation is trivial: pick row N from model N's output.
+DEFAULT_ONNX_FILES = {
+    "drums":  HERE / "htdemucs_ft_drums.onnx",
+    "bass":   HERE / "htdemucs_ft_bass.onnx",
+    "other":  HERE / "htdemucs_ft_other.onnx",
+    "vocals": HERE / "htdemucs_ft_vocals.onnx",
+}
+def _make_transition_window(segment: int, overlap_frac: float = 0.25) -> np.ndarray:
+    transition = int(segment * overlap_frac)
+    window = np.ones(segment, dtype=np.float32)
+    fade = np.linspace(0, 1, transition, dtype=np.float32)
+    window[:transition] = fade
+    window[-transition:] = fade[::-1]
+    return window
+def _load_sessions(onnx_files: dict[str, Path],
+                   providers: list[str] | None = None,
+                   ) -> dict[str, ort.InferenceSession]:
+    if providers is None:
+        providers = ["CPUExecutionProvider"]
+    sessions: dict[str, ort.InferenceSession] = {}
+    for stem, path in onnx_files.items():
+        if not path.exists():
+            raise FileNotFoundError(
+                f"Missing {stem} model at {path}. Download all 4 .onnx files "
+                "into the same directory as this script.")
+        sessions[stem] = ort.InferenceSession(str(path), providers=providers)
+    return sessions
+def separate(mix: np.ndarray, sample_rate: int,
+             onnx_files: dict[str, Path] | None = None,
+             providers: list[str] | None = None,
+             verbose: bool = True) -> dict[str, np.ndarray]:
+    """Run full 4-stem chunked overlap-add separation.
+    Args:
+        mix: (channels, samples) float32 in [-1, 1], 44.1 kHz stereo.
+        sample_rate: must equal 44100.
+        onnx_files: optional dict overriding the default file locations.
+        providers: onnxruntime EPs; defaults to CPU.
+        verbose: print progress per chunk.
+    Returns:
+        dict of {stem_name: (channels, samples) float32}.
+    """
+    if sample_rate != SAMPLE_RATE:
+        raise ValueError(f"Bound to {SAMPLE_RATE} Hz; got {sample_rate}.")
+    if mix.ndim != 2 or mix.shape[0] != N_CHANNELS:
+        raise ValueError(f"Expected (2, samples) input, got {mix.shape}")
+    sessions = _load_sessions(onnx_files or DEFAULT_ONNX_FILES, providers)
+    if verbose:
+        print(f"  loaded {len(sessions)} ONNX sessions on "
+              f"{list(sessions.values())[0].get_providers()[0]}")
+    total_len = mix.shape[1]
+    overlap = N_SAMPLES // 4
+    stride = N_SAMPLES - overlap
+    n_chunks = max(1, (total_len + stride - 1) // stride)
+    if verbose:
+        print(f"  input:  {total_len:,} samples ({total_len / sample_rate:.1f}s)")
+        print(f"  chunks: {n_chunks}")
+    window = _make_transition_window(N_SAMPLES)
+    out = {stem: np.zeros((N_CHANNELS, total_len), dtype=np.float32) for stem in SOURCES}
+    weight = np.zeros(total_len, dtype=np.float32)
+    t0 = time.perf_counter()
+    for i in range(n_chunks):
+        start = i * stride
+        end = min(start + N_SAMPLES, total_len)
+        chunk = mix[:, start:end]
+        if chunk.shape[1] < N_SAMPLES:
+            chunk = np.pad(chunk, ((0, 0), (0, N_SAMPLES - chunk.shape[1])),
+                           mode="constant")
+        x = chunk[np.newaxis, ...].astype(np.float32)
+        chunk_len = end - start
+        w = window[:chunk_len]
+        # Run each specialist; take only its target stem row.
+        for stem in SOURCES:
+            stems = sessions[stem].run(["stems"], {"mix": x})[0][0]  # (4, 2, N)
+            target_row = SOURCES.index(stem)  # 0/1/2/3 matches bag.models[idx]
+            out[stem][:, start:end] += stems[target_row, :, :chunk_len] * w
+        weight[start:end] += w
+        if verbose:
+            print(f"    chunk {i+1}/{n_chunks}: "
+                  f"{time.perf_counter() - t0:.1f}s elapsed")
+    weight = np.maximum(weight, 1e-8)
+    for stem in SOURCES:
+        out[stem] /= weight
+    if verbose:
+        rtf = (time.perf_counter() - t0) / (total_len / sample_rate)
+        print(f"  total:  {time.perf_counter() - t0:.2f}s (RTF {rtf:.2f}, "
+              f"4 sub-models × {n_chunks} chunks = "
+              f"{4 * n_chunks} ONNX runs)")
+    return out
+def separate_all(input_path: str, **kwargs) -> dict[str, np.ndarray]:
+    """Convenience: load audio, run separation, return all 4 stems."""
+    audio, sr = sf.read(input_path, dtype="float32", always_2d=True)
+    audio = audio.T
+    if audio.shape[0] == 1:
+        audio = np.tile(audio, (2, 1))
+    elif audio.shape[0] > 2:
+        audio = audio[:2]
+    return separate(audio, sr, **kwargs)
+def main() -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("input", type=Path)
+    ap.add_argument("out_dir", type=Path)
+    ap.add_argument("--providers", type=str, default="cpu",
+                    choices=["cpu", "coreml", "cuda", "dml"])
+    args = ap.parse_args()
+    providers_map = {
+        "cpu":    ["CPUExecutionProvider"],
+        "coreml": ["CoreMLExecutionProvider", "CPUExecutionProvider"],
+        "cuda":   ["CUDAExecutionProvider", "CPUExecutionProvider"],
+        "dml":    ["DmlExecutionProvider", "CPUExecutionProvider"],
+    }
+    args.out_dir.mkdir(parents=True, exist_ok=True)
+    print(f"Loading {args.input} ...")
+    audio, sr = sf.read(str(args.input), dtype="float32", always_2d=True)
+    audio = audio.T
+    if audio.shape[0] == 1:
+        audio = np.tile(audio, (2, 1))
+    elif audio.shape[0] > 2:
+        audio = audio[:2]
+    print(f"  shape {audio.shape}, sr {sr}")
+    stems = separate(audio, sr, providers=providers_map[args.providers])
+    for stem, audio_out in stems.items():
+        out_path = args.out_dir / f"{stem}.wav"
+        sf.write(str(out_path), audio_out.T, sr)
+        print(f"  wrote {out_path}")
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+onnxruntime>=1.20
+numpy>=1.24
+soundfile>=0.12