| --- |
| language: en |
| license: mit |
| library_name: demucs |
| pipeline_tag: audio-to-audio |
| tags: |
| - stem-separation |
| - source-separation |
| - vocal-isolation |
| - vocal-remover |
| - music |
| - demucs |
| - htdemucs |
| - karaoke |
| - audio-to-audio |
| datasets: |
| - StemSplitio/stem-separation-benchmark-2026 |
| inference: false |
| --- |
| |
| # HT-Demucs FT β Production-ready PyTorch model card |
|
|
| The **highest-vocal-SDR open-source stem separator on MUSDB18-HQ** (9.19 dB |
| median), packaged for Hugging Face Inference Endpoints with a ready-to-deploy |
| `handler.py`. Use it for vocal removal, karaoke generation, acapella |
| extraction, and any task that needs clean 4-stem separation of music |
| (`vocals`, `drums`, `bass`, `other`). |
|
|
| This is the `htdemucs_ft` 4-bag ensemble by [DΓ©fossez et al. (Meta AI)][demucs-repo], |
| repackaged with attribution. Original training and weights are unchanged; |
| we add the deployment handler, the model card, and the benchmark context. |
|
|
| > Need it as a REST API today, without standing up GPUs? Use the |
| > [**StemSplit API**](https://stemsplit.io/developers) β same model, hosted |
| > for you, with credits and a dashboard. |
|
|
| --- |
|
|
| ## Quality (independently benchmarked) |
|
|
| Median SDR per stem on the standard MUSDB18-HQ test split (50 songs), BSS |
| Eval v4 via `museval`. Higher is better. Source: |
| [StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026) |
| v1.1. |
|
|
| | Model | vocals | drums | bass | other | |
| |---|---:|---:|---:|---:| |
| | `htdemucs_ft` *(this card)* | **9.19** | 10.11 | 10.38 | 6.34 | |
| | `mdx_extra_q` | 9.04 | **11.49** | **11.42** | **7.67** | |
| | `htdemucs_6s` | 8.66 | 9.54 | 9.11 | 5.74 | |
| | `htdemucs` | 8.53 | 10.01 | 9.78 | 6.42 | |
| | `mdx_net_inst_hq3` *(vocals-only)* | 5.81 | β | β | β | |
|
|
| **Pick this model when vocals are the priority** β it beats every other |
| open-source separator on MUSDB18-HQ vocals. For drums/bass-focused work, |
| consider `mdx_extra_q` instead. |
|
|
| --- |
|
|
| ## Quick start (Python) |
|
|
| ```python |
| import base64, io, soundfile as sf |
| from huggingface_hub import InferenceClient |
| |
| with open("your-song.mp3", "rb") as f: |
| audio_b64 = base64.b64encode(f.read()).decode() |
| |
| client = InferenceClient(model="StemSplitio/htdemucs-ft-pytorch") |
| result = client.post(json={"inputs": audio_b64}) |
| |
| for stem in ("vocals", "drums", "bass", "other"): |
| wav, sr = sf.read(io.BytesIO(base64.b64decode(result[stem]))) |
| sf.write(f"out_{stem}.wav", wav, sr) |
| ``` |
|
|
| Or run locally without Hugging Face at all: |
|
|
| ```python |
| import torch, soundfile as sf |
| from demucs.apply import apply_model |
| from demucs.audio import convert_audio |
| from demucs.pretrained import get_model |
| |
| model = get_model("htdemucs_ft").eval() |
| wav, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True) |
| wav = torch.from_numpy(wav.T).contiguous() |
| wav = convert_audio(wav, sr, model.samplerate, model.audio_channels).unsqueeze(0) |
| |
| with torch.no_grad(): |
| stems = apply_model(model, wav, device="mps" if torch.backends.mps.is_available() else "cpu")[0] |
| |
| for i, name in enumerate(model.sources): # ["drums", "bass", "other", "vocals"] |
| sf.write(f"out_{name}.wav", stems[i].T.numpy(), model.samplerate) |
| ``` |
|
|
| --- |
|
|
| ## Deploy on Hugging Face Inference Endpoints |
|
|
| Click **Deploy β Inference Endpoints** above, pick a GPU instance, and HF |
| will spin up a container running [`handler.py`](handler.py). Recommended |
| hardware tiers based on M4 Pro reference latency: |
|
|
| | Hardware | RTF | Latency for 3-min song | |
| |---|---:|---:| |
| | NVIDIA L4 | ~0.04 | ~7 s | |
| | NVIDIA T4 small | ~0.10 | ~18 s | |
| | CPU x4 (basic) | ~0.7 | ~125 s | |
|
|
| Then call the endpoint: |
|
|
| ```bash |
| curl -X POST https://<your-endpoint>.endpoints.huggingface.cloud \ |
| -H "Authorization: Bearer $HF_TOKEN" \ |
| -H "Content-Type: application/json" \ |
| -d "{\"inputs\": \"$(base64 < your-song.mp3)\"}" |
| ``` |
|
|
| Response is a JSON object with `vocals`, `drums`, `bass`, `other` |
| base64-encoded WAVs at 44.1 kHz. |
|
|
| --- |
|
|
| ## Skip the infrastructure β use the StemSplit API |
|
|
| If you'd rather not run your own endpoint, the |
| [**StemSplit API**](https://stemsplit.io/developers) wraps this same model |
| (and the rest of the [benchmarked lineup](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)) |
| behind a hosted REST API with credits, a dashboard, and webhooks. |
|
|
| ```bash |
| curl -X POST https://stemsplit.io/api/v1/jobs \ |
| -H "Authorization: Bearer $STEMSPLIT_API_KEY" \ |
| -F "audio=@your-song.mp3" \ |
| -F "model=htdemucs_ft" |
| ``` |
|
|
| - π [Developer docs](https://stemsplit.io/developers/docs) |
| - π [API reference](https://stemsplit.io/developers/reference) |
| - π [Guides & recipes](https://stemsplit.io/developers/guides) |
|
|
| Or try it in your browser, no code: |
|
|
| - π€ [**Vocal Remover** (free online tool)](https://stemsplit.io/vocal-remover) |
| β upload a song, get an instrumental + isolated vocals |
| - πΆ [Karaoke Maker](https://stemsplit.io/karaoke-maker) β |
| same model, optimised for karaoke output |
| - ποΈ [Acapella Maker](https://stemsplit.io/acapella-maker) β extract clean |
| vocal acapellas for remixes and sampling |
| - πΊ [YouTube Stem Splitter](https://stemsplit.io/youtube-stem-splitter) β |
| paste a YouTube URL, get the stems |
|
|
| --- |
|
|
| ## Performance |
|
|
| Measured on an Apple M4 Pro (24 GB unified memory) with PyTorch 2.4 MPS, for |
| the full 4-bag ensemble on 50 MUSDB18-HQ tracks (median track length ~4 min, |
| RTF 0.26 Β± 0.02). Cloud GPU numbers are extrapolated from public Demucs |
| benchmarks. |
|
|
| | Hardware | Per 3-min song | Peak RAM | Notes | |
| |---|---:|---:|---| |
| | Apple M4 Pro (MPS) | ~47 s | 3.1 GB | Measured in [our benchmark](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026) (RTF 0.26) | |
| | NVIDIA L4 (CUDA) | ~7 s | 4 GB | Extrapolated | |
| | NVIDIA T4 small (CUDA) | ~18 s | 4 GB | Extrapolated | |
| | CPU (8-core) | ~125 s | 3 GB | Slow, but works for batch jobs | |
|
|
| --- |
|
|
| ## How `htdemucs_ft` differs from the other Demucs models |
| |
| | Variant | Bag size | Best at | When to choose | |
| |---|---:|---|---| |
| | `htdemucs_ft` *(this)* | 4 | **Vocals** | Karaoke, vocal isolation, acapella extraction | |
| | `htdemucs` | 1 | Balanced | Lower latency / smaller deploy | |
| | `htdemucs_6s` | 1 | 6-stem (adds piano, guitar) | When you need piano/guitar separately | |
| | `mdx_extra_q` | 4 | **Drums, bass** | Music production where rhythm section is the priority | |
|
|
| See the full |
| [stem-separation benchmark dataset](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026) |
| for SDR / ISR / SIR / SAR across all stems. |
|
|
| --- |
|
|
| ## Single-stem specialist variants (faster, smaller) |
|
|
| If you only need **one** stem in production, ship a specialist sub-model |
| instead of the full 4-bag ensemble. Same per-stem quality, ~160 MB instead |
| of ~640 MB, ~2.6Γ faster on M4 Pro MPS: |
|
|
| | Repo | Stem | Use cases | |
| |---|---|---| |
| | [`htdemucs-ft-drums-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-drums-pytorch) | drums | Drum extraction, beat transcription, sample-pack creation | |
| | [`htdemucs-ft-bass-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-bass-pytorch) | bass | Bassline transcription, mix rebalancing, sub-bass mastering | |
| | [`htdemucs-ft-other-pytorch`](https://huggingface.co/StemSplitio/htdemucs-ft-other-pytorch) | other / instrumental | Karaoke instrumentals (pair with this vocals model), sample-flipping | |
|
|
| This repo (the full bag) remains the best choice when you need vocals plus |
| any other stem in a single request β it amortises the inference cost across |
| all 4 stems. |
|
|
| --- |
|
|
| ## Files in this repo |
|
|
| - [`handler.py`](handler.py) β `EndpointHandler` class HF Inference Endpoints |
| calls on each request. Accepts base64 audio in, returns base64 stems out. |
| - [`requirements.txt`](requirements.txt) β Python deps (torch, demucs, soundfile). |
| - `README.md` β this card. |
|
|
| Model weights are downloaded into the container's torch hub cache on first |
| run (no `.pt` / `.th` files are stored in this repo to keep it small). |
|
|
| --- |
|
|
| ## License & attribution |
|
|
| This repo is **MIT-licensed**, matching the original HT-Demucs. |
|
|
| **Please cite the original authors** if you use this model in research: |
|
|
| ```bibtex |
| @inproceedings{rouard2023hybrid, |
| title = {Hybrid Transformers for Music Source Separation}, |
| author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre}, |
| booktitle = {ICASSP}, |
| year = {2023} |
| } |
| ``` |
|
|
| And if you use the benchmark or this packaging: |
|
|
| ```bibtex |
| @misc{stemsplit_benchmark_2026, |
| title = {StemSplit Stem-Separation Benchmark 2026}, |
| author = {StemSplit}, |
| year = {2026}, |
| url = {https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026} |
| } |
| ``` |
|
|
| - Original model: [`facebookresearch/demucs`][demucs-repo] |
| - Packaging by [StemSplit](https://stemsplit.io) |
| - Benchmark dataset: [StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026) |
|
|
| [demucs-repo]: https://github.com/facebookresearch/demucs |
|
|