StemSplit commited on
Commit
71605c4
·
verified ·
1 Parent(s): 522bca6

feat: initial release (HT-Demucs FT, PyTorch, ready for HF Inference Endpoints)

Browse files
Files changed (3) hide show
  1. README.md +229 -0
  2. handler.py +110 -0
  3. requirements.txt +5 -0
README.md ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ library_name: demucs
5
+ pipeline_tag: audio-to-audio
6
+ tags:
7
+ - stem-separation
8
+ - source-separation
9
+ - vocal-isolation
10
+ - vocal-remover
11
+ - music
12
+ - demucs
13
+ - htdemucs
14
+ - karaoke
15
+ - audio-to-audio
16
+ datasets:
17
+ - StemSplitio/stem-separation-benchmark-2026
18
+ inference: false
19
+ ---
20
+
21
+ # HT-Demucs FT — Production-ready PyTorch model card
22
+
23
+ The **highest-vocal-SDR open-source stem separator on MUSDB18-HQ** (9.19 dB
24
+ median), packaged for Hugging Face Inference Endpoints with a ready-to-deploy
25
+ `handler.py`. Use it for vocal removal, karaoke generation, acapella
26
+ extraction, and any task that needs clean 4-stem separation of music
27
+ (`vocals`, `drums`, `bass`, `other`).
28
+
29
+ This is the `htdemucs_ft` 4-bag ensemble by [Défossez et al. (Meta AI)][demucs-repo],
30
+ repackaged with attribution. Original training and weights are unchanged;
31
+ we add the deployment handler, the model card, and the benchmark context.
32
+
33
+ > Need it as a REST API today, without standing up GPUs? Use the
34
+ > [**StemSplit API**](https://stemsplit.io/developers) — same model, hosted
35
+ > for you, with credits and a dashboard.
36
+
37
+ ---
38
+
39
+ ## Quality (independently benchmarked)
40
+
41
+ Median SDR per stem on the standard MUSDB18-HQ test split (50 songs), BSS
42
+ Eval v4 via `museval`. Higher is better. Source:
43
+ [StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)
44
+ v1.1.
45
+
46
+ | Model | vocals | drums | bass | other |
47
+ |---|---:|---:|---:|---:|
48
+ | `htdemucs_ft` *(this card)* | **9.19** | 10.11 | 10.38 | 6.34 |
49
+ | `mdx_extra_q` | 9.04 | **11.49** | **11.42** | **7.67** |
50
+ | `htdemucs_6s` | 8.66 | 9.54 | 9.11 | 5.74 |
51
+ | `htdemucs` | 8.53 | 10.01 | 9.78 | 6.42 |
52
+ | `mdx_net_inst_hq3` *(vocals-only)* | 5.81 | — | — | — |
53
+
54
+ **Pick this model when vocals are the priority** — it beats every other
55
+ open-source separator on MUSDB18-HQ vocals. For drums/bass-focused work,
56
+ consider `mdx_extra_q` instead.
57
+
58
+ ---
59
+
60
+ ## Quick start (Python)
61
+
62
+ ```python
63
+ import base64, io, soundfile as sf
64
+ from huggingface_hub import InferenceClient
65
+
66
+ with open("your-song.mp3", "rb") as f:
67
+ audio_b64 = base64.b64encode(f.read()).decode()
68
+
69
+ client = InferenceClient(model="StemSplitio/htdemucs-ft-pytorch")
70
+ result = client.post(json={"inputs": audio_b64})
71
+
72
+ for stem in ("vocals", "drums", "bass", "other"):
73
+ wav, sr = sf.read(io.BytesIO(base64.b64decode(result[stem])))
74
+ sf.write(f"out_{stem}.wav", wav, sr)
75
+ ```
76
+
77
+ Or run locally without Hugging Face at all:
78
+
79
+ ```python
80
+ import torch, soundfile as sf
81
+ from demucs.apply import apply_model
82
+ from demucs.audio import convert_audio
83
+ from demucs.pretrained import get_model
84
+
85
+ model = get_model("htdemucs_ft").eval()
86
+ wav, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
87
+ wav = torch.from_numpy(wav.T).contiguous()
88
+ wav = convert_audio(wav, sr, model.samplerate, model.audio_channels).unsqueeze(0)
89
+
90
+ with torch.no_grad():
91
+ stems = apply_model(model, wav, device="mps" if torch.backends.mps.is_available() else "cpu")[0]
92
+
93
+ for i, name in enumerate(model.sources): # ["drums", "bass", "other", "vocals"]
94
+ sf.write(f"out_{name}.wav", stems[i].T.numpy(), model.samplerate)
95
+ ```
96
+
97
+ ---
98
+
99
+ ## Deploy on Hugging Face Inference Endpoints
100
+
101
+ Click **Deploy → Inference Endpoints** above, pick a GPU instance, and HF
102
+ will spin up a container running [`handler.py`](handler.py). Recommended
103
+ hardware tiers based on M4 Pro reference latency:
104
+
105
+ | Hardware | RTF | Latency for 3-min song |
106
+ |---|---:|---:|
107
+ | NVIDIA L4 | ~0.04 | ~7 s |
108
+ | NVIDIA T4 small | ~0.10 | ~18 s |
109
+ | CPU x4 (basic) | ~0.7 | ~125 s |
110
+
111
+ Then call the endpoint:
112
+
113
+ ```bash
114
+ curl -X POST https://<your-endpoint>.endpoints.huggingface.cloud \
115
+ -H "Authorization: Bearer $HF_TOKEN" \
116
+ -H "Content-Type: application/json" \
117
+ -d "{\"inputs\": \"$(base64 < your-song.mp3)\"}"
118
+ ```
119
+
120
+ Response is a JSON object with `vocals`, `drums`, `bass`, `other`
121
+ base64-encoded WAVs at 44.1 kHz.
122
+
123
+ ---
124
+
125
+ ## Skip the infrastructure — use the StemSplit API
126
+
127
+ If you'd rather not run your own endpoint, the
128
+ [**StemSplit API**](https://stemsplit.io/developers) wraps this same model
129
+ (and the rest of the [benchmarked lineup](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026))
130
+ behind a hosted REST API with credits, a dashboard, and webhooks.
131
+
132
+ ```bash
133
+ curl -X POST https://stemsplit.io/api/v1/jobs \
134
+ -H "Authorization: Bearer $STEMSPLIT_API_KEY" \
135
+ -F "audio=@your-song.mp3" \
136
+ -F "model=htdemucs_ft"
137
+ ```
138
+
139
+ - 📘 [Developer docs](https://stemsplit.io/developers/docs)
140
+ - 🔌 [API reference](https://stemsplit.io/developers/reference)
141
+ - 📚 [Guides & recipes](https://stemsplit.io/developers/guides)
142
+
143
+ Or try it in your browser, no code:
144
+
145
+ - 🎤 [**Vocal Remover** (free online tool)](https://stemsplit.io/vocal-remover)
146
+ — upload a song, get an instrumental + isolated vocals
147
+ - 🎶 [Karaoke Maker](https://stemsplit.io/karaoke-maker) —
148
+ same model, optimised for karaoke output
149
+ - 🎙️ [Acapella Maker](https://stemsplit.io/acapella-maker) — extract clean
150
+ vocal acapellas for remixes and sampling
151
+ - 📺 [YouTube Stem Splitter](https://stemsplit.io/youtube-stem-splitter) —
152
+ paste a YouTube URL, get the stems
153
+
154
+ ---
155
+
156
+ ## Performance
157
+
158
+ Measured on an Apple M4 Pro (24 GB unified memory) with PyTorch 2.4 MPS, for
159
+ the full 4-bag ensemble on 50 MUSDB18-HQ tracks (median track length ~4 min,
160
+ RTF 0.26 ± 0.02). Cloud GPU numbers are extrapolated from public Demucs
161
+ benchmarks.
162
+
163
+ | Hardware | Per 3-min song | Peak RAM | Notes |
164
+ |---|---:|---:|---|
165
+ | Apple M4 Pro (MPS) | ~47 s | 3.1 GB | Measured in [our benchmark](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026) (RTF 0.26) |
166
+ | NVIDIA L4 (CUDA) | ~7 s | 4 GB | Extrapolated |
167
+ | NVIDIA T4 small (CUDA) | ~18 s | 4 GB | Extrapolated |
168
+ | CPU (8-core) | ~125 s | 3 GB | Slow, but works for batch jobs |
169
+
170
+ ---
171
+
172
+ ## How `htdemucs_ft` differs from the other Demucs models
173
+
174
+ | Variant | Bag size | Best at | When to choose |
175
+ |---|---:|---|---|
176
+ | `htdemucs_ft` *(this)* | 4 | **Vocals** | Karaoke, vocal isolation, acapella extraction |
177
+ | `htdemucs` | 1 | Balanced | Lower latency / smaller deploy |
178
+ | `htdemucs_6s` | 1 | 6-stem (adds piano, guitar) | When you need piano/guitar separately |
179
+ | `mdx_extra_q` | 4 | **Drums, bass** | Music production where rhythm section is the priority |
180
+
181
+ See the full
182
+ [stem-separation benchmark dataset](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)
183
+ for SDR / ISR / SIR / SAR across all stems.
184
+
185
+ ---
186
+
187
+ ## Files in this repo
188
+
189
+ - [`handler.py`](handler.py) — `EndpointHandler` class HF Inference Endpoints
190
+ calls on each request. Accepts base64 audio in, returns base64 stems out.
191
+ - [`requirements.txt`](requirements.txt) — Python deps (torch, demucs, soundfile).
192
+ - `README.md` — this card.
193
+
194
+ Model weights are downloaded into the container's torch hub cache on first
195
+ run (no `.pt` / `.th` files are stored in this repo to keep it small).
196
+
197
+ ---
198
+
199
+ ## License & attribution
200
+
201
+ This repo is **MIT-licensed**, matching the original HT-Demucs.
202
+
203
+ **Please cite the original authors** if you use this model in research:
204
+
205
+ ```bibtex
206
+ @inproceedings{rouard2023hybrid,
207
+ title = {Hybrid Transformers for Music Source Separation},
208
+ author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
209
+ booktitle = {ICASSP},
210
+ year = {2023}
211
+ }
212
+ ```
213
+
214
+ And if you use the benchmark or this packaging:
215
+
216
+ ```bibtex
217
+ @misc{stemsplit_benchmark_2026,
218
+ title = {StemSplit Stem-Separation Benchmark 2026},
219
+ author = {StemSplit},
220
+ year = {2026},
221
+ url = {https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026}
222
+ }
223
+ ```
224
+
225
+ - Original model: [`facebookresearch/demucs`][demucs-repo]
226
+ - Packaging by [StemSplit](https://stemsplit.io)
227
+ - Benchmark dataset: [StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)
228
+
229
+ [demucs-repo]: https://github.com/facebookresearch/demucs
handler.py ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ HF Inference Endpoint handler for HT-Demucs FT.
3
+
4
+ When deployed to an HF Inference Endpoint, HF instantiates EndpointHandler
5
+ once at container startup (downloading the demucs checkpoints into the
6
+ container cache), then calls __call__ on every HTTP request.
7
+
8
+ Request shape:
9
+ POST /
10
+ Content-Type: application/json
11
+ {
12
+ "inputs": "<base64-encoded audio bytes; any libsndfile-readable format>",
13
+ "parameters": {
14
+ "stems": ["vocals", "drums", "bass", "other"] // optional, defaults to all 4
15
+ }
16
+ }
17
+
18
+ Response shape:
19
+ {
20
+ "vocals": "<base64 WAV>",
21
+ "drums": "<base64 WAV>",
22
+ "bass": "<base64 WAV>",
23
+ "other": "<base64 WAV>",
24
+ "sample_rate": 44100,
25
+ "duration_s": 123.4
26
+ }
27
+
28
+ To deploy:
29
+ 1) Create the endpoint in the HF UI (Deploy -> Inference Endpoints on the
30
+ model card), choose a GPU instance (T4 small minimum; L4 recommended)
31
+ 2) Send requests as shown above.
32
+
33
+ Or skip self-hosting and use the StemSplit API:
34
+ https://stemsplit.io/developers
35
+ """
36
+ from __future__ import annotations
37
+
38
+ import base64
39
+ import io
40
+ from typing import Any
41
+
42
+ import numpy as np
43
+ import soundfile as sf
44
+ import torch
45
+ from demucs.apply import apply_model
46
+ from demucs.audio import convert_audio
47
+ from demucs.pretrained import get_model
48
+
49
+ DEFAULT_STEMS = ("vocals", "drums", "bass", "other")
50
+
51
+
52
+ def _audio_to_b64_wav(audio: torch.Tensor, sample_rate: int) -> str:
53
+ """Encode a (channels, samples) FP32 tensor as base64-PCM16 WAV."""
54
+ np_audio = audio.cpu().numpy().T # -> (samples, channels)
55
+ np_audio = np.clip(np_audio, -1.0, 1.0)
56
+ buf = io.BytesIO()
57
+ sf.write(buf, np_audio, sample_rate, subtype="PCM_16", format="WAV")
58
+ return base64.b64encode(buf.getvalue()).decode("ascii")
59
+
60
+
61
+ class EndpointHandler:
62
+ """HF Inference Endpoint entrypoint."""
63
+
64
+ def __init__(self, path: str = "") -> None:
65
+ self.model = get_model("htdemucs_ft")
66
+ self.model.eval()
67
+ self.device = torch.device(
68
+ "cuda" if torch.cuda.is_available() else
69
+ "mps" if torch.backends.mps.is_available() else
70
+ "cpu"
71
+ )
72
+ self.model.to(self.device)
73
+ self.sample_rate = int(self.model.samplerate)
74
+ self.audio_channels = int(self.model.audio_channels)
75
+ self.sources = list(self.model.sources)
76
+
77
+ def __call__(self, data: dict[str, Any]) -> dict[str, Any]:
78
+ if "inputs" not in data:
79
+ return {"error": "Request body must include base64 audio under 'inputs'."}
80
+
81
+ audio_bytes = base64.b64decode(data["inputs"])
82
+ try:
83
+ wav_np, sr = sf.read(io.BytesIO(audio_bytes), dtype="float32", always_2d=True)
84
+ except Exception as e: # noqa: BLE001
85
+ return {"error": f"Could not decode audio: {type(e).__name__}: {e}"}
86
+
87
+ # wav_np: (samples, channels) -> (channels, samples) FP32
88
+ wav = torch.from_numpy(wav_np.T).contiguous()
89
+ wav = convert_audio(wav, sr, self.sample_rate, self.audio_channels)
90
+ wav = wav.unsqueeze(0).to(self.device) # (1, channels, samples)
91
+
92
+ # Optional stem filter
93
+ params = data.get("parameters", {}) or {}
94
+ requested_stems = [s for s in params.get("stems", DEFAULT_STEMS) if s in self.sources]
95
+ if not requested_stems:
96
+ requested_stems = list(self.sources)
97
+
98
+ with torch.no_grad():
99
+ # apply_model handles overlap-add segmentation internally
100
+ stems = apply_model(self.model, wav, device=str(self.device), progress=False)[0]
101
+ # stems: (n_sources, channels, samples) on `self.device`
102
+
103
+ out: dict[str, Any] = {
104
+ "sample_rate": self.sample_rate,
105
+ "duration_s": round(wav.shape[-1] / self.sample_rate, 3),
106
+ }
107
+ for stem in requested_stems:
108
+ idx = self.sources.index(stem)
109
+ out[stem] = _audio_to_b64_wav(stems[idx], self.sample_rate)
110
+ return out
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ torch>=2.2,<2.6
2
+ torchaudio>=2.2,<2.6
3
+ demucs==4.0.1
4
+ numpy>=1.26,<2.0
5
+ soundfile>=0.12