htdemucs-ft-pytorch / README.md
StemSplit's picture
Cross-link to drums/bass/other specialist variants
ae3634e verified
---
language: en
license: mit
library_name: demucs
pipeline_tag: audio-to-audio
tags:
- stem-separation
- source-separation
- vocal-isolation
- vocal-remover
- music
- demucs
- htdemucs
- karaoke
- audio-to-audio
datasets:
- StemSplitio/stem-separation-benchmark-2026
inference: false
---
# HT-Demucs FT β€” Production-ready PyTorch model card
The **highest-vocal-SDR open-source stem separator on MUSDB18-HQ** (9.19 dB
median), packaged for Hugging Face Inference Endpoints with a ready-to-deploy
`handler.py`. Use it for vocal removal, karaoke generation, acapella
extraction, and any task that needs clean 4-stem separation of music
(`vocals`, `drums`, `bass`, `other`).
This is the `htdemucs_ft` 4-bag ensemble by [DΓ©fossez et al. (Meta AI)][demucs-repo],
repackaged with attribution. Original training and weights are unchanged;
we add the deployment handler, the model card, and the benchmark context.
> Need it as a REST API today, without standing up GPUs? Use the
> [**StemSplit API**](https://stemsplit.io/developers) β€” same model, hosted
> for you, with credits and a dashboard.
---
## Quality (independently benchmarked)
Median SDR per stem on the standard MUSDB18-HQ test split (50 songs), BSS
Eval v4 via `museval`. Higher is better. Source:
[StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)
v1.1.
| Model | vocals | drums | bass | other |
|---|---:|---:|---:|---:|
| `htdemucs_ft` *(this card)* | **9.19** | 10.11 | 10.38 | 6.34 |
| `mdx_extra_q` | 9.04 | **11.49** | **11.42** | **7.67** |
| `htdemucs_6s` | 8.66 | 9.54 | 9.11 | 5.74 |
| `htdemucs` | 8.53 | 10.01 | 9.78 | 6.42 |
| `mdx_net_inst_hq3` *(vocals-only)* | 5.81 | β€” | β€” | β€” |
**Pick this model when vocals are the priority** β€” it beats every other
open-source separator on MUSDB18-HQ vocals. For drums/bass-focused work,
consider `mdx_extra_q` instead.
---
## Quick start (Python)
```python
import base64, io, soundfile as sf
from huggingface_hub import InferenceClient
with open("your-song.mp3", "rb") as f:
audio_b64 = base64.b64encode(f.read()).decode()
client = InferenceClient(model="StemSplitio/htdemucs-ft-pytorch")
result = client.post(json={"inputs": audio_b64})
for stem in ("vocals", "drums", "bass", "other"):
wav, sr = sf.read(io.BytesIO(base64.b64decode(result[stem])))
sf.write(f"out_{stem}.wav", wav, sr)
```
Or run locally without Hugging Face at all:
```python
import torch, soundfile as sf
from demucs.apply import apply_model
from demucs.audio import convert_audio
from demucs.pretrained import get_model
model = get_model("htdemucs_ft").eval()
wav, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
wav = torch.from_numpy(wav.T).contiguous()
wav = convert_audio(wav, sr, model.samplerate, model.audio_channels).unsqueeze(0)
with torch.no_grad():
stems = apply_model(model, wav, device="mps" if torch.backends.mps.is_available() else "cpu")[0]
for i, name in enumerate(model.sources): # ["drums", "bass", "other", "vocals"]
sf.write(f"out_{name}.wav", stems[i].T.numpy(), model.samplerate)
```
---
## Deploy on Hugging Face Inference Endpoints
Click **Deploy β†’ Inference Endpoints** above, pick a GPU instance, and HF
will spin up a container running [`handler.py`](handler.py). Recommended
hardware tiers based on M4 Pro reference latency:
| Hardware | RTF | Latency for 3-min song |
|---|---:|---:|
| NVIDIA L4 | ~0.04 | ~7 s |
| NVIDIA T4 small | ~0.10 | ~18 s |
| CPU x4 (basic) | ~0.7 | ~125 s |
Then call the endpoint:
```bash
curl -X POST https://<your-endpoint>.endpoints.huggingface.cloud \
-H "Authorization: Bearer $HF_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"inputs\": \"$(base64 < your-song.mp3)\"}"
```
Response is a JSON object with `vocals`, `drums`, `bass`, `other`
base64-encoded WAVs at 44.1 kHz.
---
## Skip the infrastructure β€” use the StemSplit API
If you'd rather not run your own endpoint, the
[**StemSplit API**](https://stemsplit.io/developers) wraps this same model
(and the rest of the [benchmarked lineup](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026))
behind a hosted REST API with credits, a dashboard, and webhooks.
```bash
curl -X POST https://stemsplit.io/api/v1/jobs \
-H "Authorization: Bearer $STEMSPLIT_API_KEY" \
-F "audio=@your-song.mp3" \
-F "model=htdemucs_ft"
```
- πŸ“˜ [Developer docs](https://stemsplit.io/developers/docs)
- πŸ”Œ [API reference](https://stemsplit.io/developers/reference)
- πŸ“š [Guides & recipes](https://stemsplit.io/developers/guides)
Or try it in your browser, no code:
- 🎀 [**Vocal Remover** (free online tool)](https://stemsplit.io/vocal-remover)
β€” upload a song, get an instrumental + isolated vocals
- 🎢 [Karaoke Maker](https://stemsplit.io/karaoke-maker) β€”
same model, optimised for karaoke output
- πŸŽ™οΈ [Acapella Maker](https://stemsplit.io/acapella-maker) β€” extract clean
vocal acapellas for remixes and sampling
- πŸ“Ί [YouTube Stem Splitter](https://stemsplit.io/youtube-stem-splitter) β€”
paste a YouTube URL, get the stems
---
## Performance
Measured on an Apple M4 Pro (24 GB unified memory) with PyTorch 2.4 MPS, for
the full 4-bag ensemble on 50 MUSDB18-HQ tracks (median track length ~4 min,
RTF 0.26 Β± 0.02). Cloud GPU numbers are extrapolated from public Demucs
benchmarks.
| Hardware | Per 3-min song | Peak RAM | Notes |
|---|---:|---:|---|
| Apple M4 Pro (MPS) | ~47 s | 3.1 GB | Measured in [our benchmark](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026) (RTF 0.26) |
| NVIDIA L4 (CUDA) | ~7 s | 4 GB | Extrapolated |
| NVIDIA T4 small (CUDA) | ~18 s | 4 GB | Extrapolated |
| CPU (8-core) | ~125 s | 3 GB | Slow, but works for batch jobs |
---
## How `htdemucs_ft` differs from the other Demucs models
| Variant | Bag size | Best at | When to choose |
|---|---:|---|---|
| `htdemucs_ft` *(this)* | 4 | **Vocals** | Karaoke, vocal isolation, acapella extraction |
| `htdemucs` | 1 | Balanced | Lower latency / smaller deploy |
| `htdemucs_6s` | 1 | 6-stem (adds piano, guitar) | When you need piano/guitar separately |
| `mdx_extra_q` | 4 | **Drums, bass** | Music production where rhythm section is the priority |
See the full
[stem-separation benchmark dataset](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)
for SDR / ISR / SIR / SAR across all stems.
---
## Single-stem specialist variants (faster, smaller)
If you only need **one** stem in production, ship a specialist sub-model
instead of the full 4-bag ensemble. Same per-stem quality, ~160 MB instead
of ~640 MB, ~2.6Γ— faster on M4 Pro MPS:
| Repo | Stem | Use cases |
|---|---|---|
| [`htdemucs-ft-drums-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-drums-pytorch) | drums | Drum extraction, beat transcription, sample-pack creation |
| [`htdemucs-ft-bass-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-bass-pytorch) | bass | Bassline transcription, mix rebalancing, sub-bass mastering |
| [`htdemucs-ft-other-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-other-pytorch) | other / instrumental | Karaoke instrumentals (pair with this vocals model), sample-flipping |
This repo (the full bag) remains the best choice when you need vocals plus
any other stem in a single request β€” it amortises the inference cost across
all 4 stems.
---
## Files in this repo
- [`handler.py`](handler.py) β€” `EndpointHandler` class HF Inference Endpoints
calls on each request. Accepts base64 audio in, returns base64 stems out.
- [`requirements.txt`](requirements.txt) β€” Python deps (torch, demucs, soundfile).
- `README.md` β€” this card.
Model weights are downloaded into the container's torch hub cache on first
run (no `.pt` / `.th` files are stored in this repo to keep it small).
---
## License & attribution
This repo is **MIT-licensed**, matching the original HT-Demucs.
**Please cite the original authors** if you use this model in research:
```bibtex
@inproceedings{rouard2023hybrid,
title = {Hybrid Transformers for Music Source Separation},
author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
booktitle = {ICASSP},
year = {2023}
}
```
And if you use the benchmark or this packaging:
```bibtex
@misc{stemsplit_benchmark_2026,
title = {StemSplit Stem-Separation Benchmark 2026},
author = {StemSplit},
year = {2026},
url = {https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026}
}
```
- Original model: [`facebookresearch/demucs`][demucs-repo]
- Packaging by [StemSplit](https://stemsplit.io)
- Benchmark dataset: [StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)
[demucs-repo]: https://github.com/facebookresearch/demucs