Cross-link to drums/bass/other specialist variants

ae3634e verified 7 days ago

8.89 kB

	---
	language: en
	license: mit
	library_name: demucs
	pipeline_tag: audio-to-audio
	tags:
	- stem-separation
	- source-separation
	- vocal-isolation
	- vocal-remover
	- music
	- demucs
	- htdemucs
	- karaoke
	- audio-to-audio
	datasets:
	- StemSplitio/stem-separation-benchmark-2026
	inference: false
	---

	# HT-Demucs FT — Production-ready PyTorch model card

	The highest-vocal-SDR open-source stem separator on MUSDB18-HQ (9.19 dB
	median), packaged for Hugging Face Inference Endpoints with a ready-to-deploy
	`handler.py`. Use it for vocal removal, karaoke generation, acapella
	extraction, and any task that needs clean 4-stem separation of music
	(`vocals`, `drums`, `bass`, `other`).

	This is the `htdemucs_ft` 4-bag ensemble by [Défossez et al. (Meta AI)][demucs-repo],
	repackaged with attribution. Original training and weights are unchanged;
	we add the deployment handler, the model card, and the benchmark context.

	> Need it as a REST API today, without standing up GPUs? Use the
	> [StemSplit API](https://stemsplit.io/developers) — same model, hosted
	> for you, with credits and a dashboard.

	---

	## Quality (independently benchmarked)

	Median SDR per stem on the standard MUSDB18-HQ test split (50 songs), BSS
	Eval v4 via `museval`. Higher is better. Source:
	[StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)
	v1.1.

	\| Model \| vocals \| drums \| bass \| other \|
	\|---\|---:\|---:\|---:\|---:\|
	\| `htdemucs_ft` (this card) \| 9.19 \| 10.11 \| 10.38 \| 6.34 \|
	\| `mdx_extra_q` \| 9.04 \| 11.49 \| 11.42 \| 7.67 \|
	\| `htdemucs_6s` \| 8.66 \| 9.54 \| 9.11 \| 5.74 \|
	\| `htdemucs` \| 8.53 \| 10.01 \| 9.78 \| 6.42 \|
	\| `mdx_net_inst_hq3` (vocals-only) \| 5.81 \| — \| — \| — \|

	Pick this model when vocals are the priority — it beats every other
	open-source separator on MUSDB18-HQ vocals. For drums/bass-focused work,
	consider `mdx_extra_q` instead.

	---

	## Quick start (Python)

	```python
	import base64, io, soundfile as sf
	from huggingface_hub import InferenceClient

	with open("your-song.mp3", "rb") as f:
	audio_b64 = base64.b64encode(f.read()).decode()

	client = InferenceClient(model="StemSplitio/htdemucs-ft-pytorch")
	result = client.post(json={"inputs": audio_b64})

	for stem in ("vocals", "drums", "bass", "other"):
	wav, sr = sf.read(io.BytesIO(base64.b64decode(result[stem])))
	sf.write(f"out_{stem}.wav", wav, sr)
	```

	Or run locally without Hugging Face at all:

	```python
	import torch, soundfile as sf
	from demucs.apply import apply_model
	from demucs.audio import convert_audio
	from demucs.pretrained import get_model

	model = get_model("htdemucs_ft").eval()
	wav, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
	wav = torch.from_numpy(wav.T).contiguous()
	wav = convert_audio(wav, sr, model.samplerate, model.audio_channels).unsqueeze(0)

	with torch.no_grad():
	stems = apply_model(model, wav, device="mps" if torch.backends.mps.is_available() else "cpu")[0]

	for i, name in enumerate(model.sources): # ["drums", "bass", "other", "vocals"]
	sf.write(f"out_{name}.wav", stems[i].T.numpy(), model.samplerate)
	```

	---

	## Deploy on Hugging Face Inference Endpoints

	Click Deploy → Inference Endpoints above, pick a GPU instance, and HF
	will spin up a container running [`handler.py`](handler.py). Recommended
	hardware tiers based on M4 Pro reference latency:

	\| Hardware \| RTF \| Latency for 3-min song \|
	\|---\|---:\|---:\|
	\| NVIDIA L4 \| ~0.04 \| ~7 s \|
	\| NVIDIA T4 small \| ~0.10 \| ~18 s \|
	\| CPU x4 (basic) \| ~0.7 \| ~125 s \|

	Then call the endpoint:

	```bash
	curl -X POST https://<your-endpoint>.endpoints.huggingface.cloud \
	-H "Authorization: Bearer $HF_TOKEN" \
	-H "Content-Type: application/json" \
	-d "{\"inputs\": \"$(base64 < your-song.mp3)\"}"
	```

	Response is a JSON object with `vocals`, `drums`, `bass`, `other`
	base64-encoded WAVs at 44.1 kHz.

	---

	## Skip the infrastructure — use the StemSplit API

	If you'd rather not run your own endpoint, the
	[StemSplit API](https://stemsplit.io/developers) wraps this same model
	(and the rest of the [benchmarked lineup](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026))
	behind a hosted REST API with credits, a dashboard, and webhooks.

	```bash
	curl -X POST https://stemsplit.io/api/v1/jobs \
	-H "Authorization: Bearer $STEMSPLIT_API_KEY" \
	-F "audio=@your-song.mp3" \
	-F "model=htdemucs_ft"
	```

	- 📘 [Developer docs](https://stemsplit.io/developers/docs)
	- 🔌 [API reference](https://stemsplit.io/developers/reference)
	- 📚 [Guides & recipes](https://stemsplit.io/developers/guides)

	Or try it in your browser, no code:

	- 🎤 [Vocal Remover (free online tool)](https://stemsplit.io/vocal-remover)
	— upload a song, get an instrumental + isolated vocals
	- 🎶 [Karaoke Maker](https://stemsplit.io/karaoke-maker) —
	same model, optimised for karaoke output
	- 🎙️ [Acapella Maker](https://stemsplit.io/acapella-maker) — extract clean
	vocal acapellas for remixes and sampling
	- 📺 [YouTube Stem Splitter](https://stemsplit.io/youtube-stem-splitter) —
	paste a YouTube URL, get the stems

	---

	## Performance

	Measured on an Apple M4 Pro (24 GB unified memory) with PyTorch 2.4 MPS, for
	the full 4-bag ensemble on 50 MUSDB18-HQ tracks (median track length ~4 min,
	RTF 0.26 ± 0.02). Cloud GPU numbers are extrapolated from public Demucs
	benchmarks.

	\| Hardware \| Per 3-min song \| Peak RAM \| Notes \|
	\|---\|---:\|---:\|---\|
	\| Apple M4 Pro (MPS) \| ~47 s \| 3.1 GB \| Measured in [our benchmark](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026) (RTF 0.26) \|
	\| NVIDIA L4 (CUDA) \| ~7 s \| 4 GB \| Extrapolated \|
	\| NVIDIA T4 small (CUDA) \| ~18 s \| 4 GB \| Extrapolated \|
	\| CPU (8-core) \| ~125 s \| 3 GB \| Slow, but works for batch jobs \|

	---

	## How `htdemucs_ft` differs from the other Demucs models

	\| Variant \| Bag size \| Best at \| When to choose \|
	\|---\|---:\|---\|---\|
	\| `htdemucs_ft` (this) \| 4 \| Vocals \| Karaoke, vocal isolation, acapella extraction \|
	\| `htdemucs` \| 1 \| Balanced \| Lower latency / smaller deploy \|
	\| `htdemucs_6s` \| 1 \| 6-stem (adds piano, guitar) \| When you need piano/guitar separately \|
	\| `mdx_extra_q` \| 4 \| Drums, bass \| Music production where rhythm section is the priority \|

	See the full
	[stem-separation benchmark dataset](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)
	for SDR / ISR / SIR / SAR across all stems.

	---

	## Single-stem specialist variants (faster, smaller)

	If you only need one stem in production, ship a specialist sub-model
	instead of the full 4-bag ensemble. Same per-stem quality, ~160 MB instead
	of ~640 MB, ~2.6× faster on M4 Pro MPS:

	\| Repo \| Stem \| Use cases \|
	\|---\|---\|---\|
	\| [`htdemucs-ft-drums-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-drums-pytorch) \| drums \| Drum extraction, beat transcription, sample-pack creation \|
	\| [`htdemucs-ft-bass-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-bass-pytorch) \| bass \| Bassline transcription, mix rebalancing, sub-bass mastering \|
	\| [`htdemucs-ft-other-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-other-pytorch) \| other / instrumental \| Karaoke instrumentals (pair with this vocals model), sample-flipping \|

	This repo (the full bag) remains the best choice when you need vocals plus
	any other stem in a single request — it amortises the inference cost across
	all 4 stems.

	---

	## Files in this repo

	- [`handler.py`](handler.py) — `EndpointHandler` class HF Inference Endpoints
	calls on each request. Accepts base64 audio in, returns base64 stems out.
	- [`requirements.txt`](requirements.txt) — Python deps (torch, demucs, soundfile).
	- `README.md` — this card.

	Model weights are downloaded into the container's torch hub cache on first
	run (no `.pt` / `.th` files are stored in this repo to keep it small).

	---

	## License & attribution

	This repo is MIT-licensed, matching the original HT-Demucs.

	Please cite the original authors if you use this model in research:

	```bibtex
	@inproceedings{rouard2023hybrid,
	title = {Hybrid Transformers for Music Source Separation},
	author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
	booktitle = {ICASSP},
	year = {2023}
	}
	```

	And if you use the benchmark or this packaging:

	```bibtex
	@misc{stemsplit_benchmark_2026,
	title = {StemSplit Stem-Separation Benchmark 2026},
	author = {StemSplit},
	year = {2026},
	url = {https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026}
	}
	```

	- Original model: [`facebookresearch/demucs`][demucs-repo]
	- Packaging by [StemSplit](https://stemsplit.io)
	- Benchmark dataset: [StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)

	[demucs-repo]: https://github.com/facebookresearch/demucs