feat: initial release (HT-Demucs FT, PyTorch, ready for HF Inference Endpoints)

Browse files

Files changed (3) hide show

README.md +229 -0
handler.py +110 -0
requirements.txt +5 -0

README.md ADDED Viewed

	@@ -0,0 +1,229 @@

+---
+language: en
+license: mit
+library_name: demucs
+pipeline_tag: audio-to-audio
+tags:
+  - stem-separation
+  - source-separation
+  - vocal-isolation
+  - vocal-remover
+  - music
+  - demucs
+  - htdemucs
+  - karaoke
+  - audio-to-audio
+datasets:
+  - StemSplitio/stem-separation-benchmark-2026
+inference: false
+---
+# HT-Demucs FT — Production-ready PyTorch model card
+The **highest-vocal-SDR open-source stem separator on MUSDB18-HQ** (9.19 dB
+median), packaged for Hugging Face Inference Endpoints with a ready-to-deploy
+`handler.py`. Use it for vocal removal, karaoke generation, acapella
+extraction, and any task that needs clean 4-stem separation of music
+(`vocals`, `drums`, `bass`, `other`).
+This is the `htdemucs_ft` 4-bag ensemble by [Défossez et al. (Meta AI)][demucs-repo],
+repackaged with attribution. Original training and weights are unchanged;
+we add the deployment handler, the model card, and the benchmark context.
+> Need it as a REST API today, without standing up GPUs? Use the
+> [**StemSplit API**](https://stemsplit.io/developers) — same model, hosted
+> for you, with credits and a dashboard.
+---
+## Quality (independently benchmarked)
+Median SDR per stem on the standard MUSDB18-HQ test split (50 songs), BSS
+Eval v4 via `museval`. Higher is better. Source:
+[StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)
+v1.1.
+| Model | vocals | drums | bass | other |
+|---|---:|---:|---:|---:|
+| `htdemucs_ft` *(this card)* | **9.19** | 10.11 | 10.38 | 6.34 |
+| `mdx_extra_q` | 9.04 | **11.49** | **11.42** | **7.67** |
+| `htdemucs_6s` | 8.66 | 9.54 | 9.11 | 5.74 |
+| `htdemucs` | 8.53 | 10.01 | 9.78 | 6.42 |
+| `mdx_net_inst_hq3` *(vocals-only)* | 5.81 | — | — | — |
+**Pick this model when vocals are the priority** — it beats every other
+open-source separator on MUSDB18-HQ vocals. For drums/bass-focused work,
+consider `mdx_extra_q` instead.
+---
+## Quick start (Python)
+```python
+import base64, io, soundfile as sf
+from huggingface_hub import InferenceClient
+with open("your-song.mp3", "rb") as f:
+    audio_b64 = base64.b64encode(f.read()).decode()
+client = InferenceClient(model="StemSplitio/htdemucs-ft-pytorch")
+result = client.post(json={"inputs": audio_b64})
+for stem in ("vocals", "drums", "bass", "other"):
+    wav, sr = sf.read(io.BytesIO(base64.b64decode(result[stem])))
+    sf.write(f"out_{stem}.wav", wav, sr)
+```
+Or run locally without Hugging Face at all:
+```python
+import torch, soundfile as sf
+from demucs.apply import apply_model
+from demucs.audio import convert_audio
+from demucs.pretrained import get_model
+model = get_model("htdemucs_ft").eval()
+wav, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
+wav = torch.from_numpy(wav.T).contiguous()
+wav = convert_audio(wav, sr, model.samplerate, model.audio_channels).unsqueeze(0)
+with torch.no_grad():
+    stems = apply_model(model, wav, device="mps" if torch.backends.mps.is_available() else "cpu")[0]
+for i, name in enumerate(model.sources):  # ["drums", "bass", "other", "vocals"]
+    sf.write(f"out_{name}.wav", stems[i].T.numpy(), model.samplerate)
+```
+---
+## Deploy on Hugging Face Inference Endpoints
+Click **Deploy → Inference Endpoints** above, pick a GPU instance, and HF
+will spin up a container running [`handler.py`](handler.py). Recommended
+hardware tiers based on M4 Pro reference latency:
+| Hardware | RTF | Latency for 3-min song |
+|---|---:|---:|
+| NVIDIA L4 | ~0.04 | ~7 s |
+| NVIDIA T4 small | ~0.10 | ~18 s |
+| CPU x4 (basic) | ~0.7 | ~125 s |
+Then call the endpoint:
+```bash
+curl -X POST https://<your-endpoint>.endpoints.huggingface.cloud \
+  -H "Authorization: Bearer $HF_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d "{\"inputs\": \"$(base64 < your-song.mp3)\"}"
+```
+Response is a JSON object with `vocals`, `drums`, `bass`, `other`
+base64-encoded WAVs at 44.1 kHz.
+---
+## Skip the infrastructure — use the StemSplit API
+If you'd rather not run your own endpoint, the
+[**StemSplit API**](https://stemsplit.io/developers) wraps this same model
+(and the rest of the [benchmarked lineup](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026))
+behind a hosted REST API with credits, a dashboard, and webhooks.
+```bash
+curl -X POST https://stemsplit.io/api/v1/jobs \
+  -H "Authorization: Bearer $STEMSPLIT_API_KEY" \
+  -F "audio=@your-song.mp3" \
+  -F "model=htdemucs_ft"
+```
+- 📘 [Developer docs](https://stemsplit.io/developers/docs)
+- 🔌 [API reference](https://stemsplit.io/developers/reference)
+- 📚 [Guides & recipes](https://stemsplit.io/developers/guides)
+Or try it in your browser, no code:
+- 🎤 [**Vocal Remover** (free online tool)](https://stemsplit.io/vocal-remover)
+  — upload a song, get an instrumental + isolated vocals
+- 🎶 [Karaoke Maker](https://stemsplit.io/karaoke-maker) —
+  same model, optimised for karaoke output
+- 🎙️ [Acapella Maker](https://stemsplit.io/acapella-maker) — extract clean
+  vocal acapellas for remixes and sampling
+- 📺 [YouTube Stem Splitter](https://stemsplit.io/youtube-stem-splitter) —
+  paste a YouTube URL, get the stems
+---
+## Performance
+Measured on an Apple M4 Pro (24 GB unified memory) with PyTorch 2.4 MPS, for
+the full 4-bag ensemble on 50 MUSDB18-HQ tracks (median track length ~4 min,
+RTF 0.26 ± 0.02). Cloud GPU numbers are extrapolated from public Demucs
+benchmarks.
+| Hardware | Per 3-min song | Peak RAM | Notes |
+|---|---:|---:|---|
+| Apple M4 Pro (MPS) | ~47 s | 3.1 GB | Measured in [our benchmark](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026) (RTF 0.26) |
+| NVIDIA L4 (CUDA) | ~7 s | 4 GB | Extrapolated |
+| NVIDIA T4 small (CUDA) | ~18 s | 4 GB | Extrapolated |
+| CPU (8-core) | ~125 s | 3 GB | Slow, but works for batch jobs |
+---
+## How `htdemucs_ft` differs from the other Demucs models
+| Variant | Bag size | Best at | When to choose |
+|---|---:|---|---|
+| `htdemucs_ft` *(this)* | 4 | **Vocals** | Karaoke, vocal isolation, acapella extraction |
+| `htdemucs` | 1 | Balanced | Lower latency / smaller deploy |
+| `htdemucs_6s` | 1 | 6-stem (adds piano, guitar) | When you need piano/guitar separately |
+| `mdx_extra_q` | 4 | **Drums, bass** | Music production where rhythm section is the priority |
+See the full
+[stem-separation benchmark dataset](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)
+for SDR / ISR / SIR / SAR across all stems.
+---
+## Files in this repo
+- [`handler.py`](handler.py) — `EndpointHandler` class HF Inference Endpoints
+  calls on each request. Accepts base64 audio in, returns base64 stems out.
+- [`requirements.txt`](requirements.txt) — Python deps (torch, demucs, soundfile).
+- `README.md` — this card.
+Model weights are downloaded into the container's torch hub cache on first
+run (no `.pt` / `.th` files are stored in this repo to keep it small).
+---
+## License & attribution
+This repo is **MIT-licensed**, matching the original HT-Demucs.
+**Please cite the original authors** if you use this model in research:
+```bibtex
+@inproceedings{rouard2023hybrid,
+  title     = {Hybrid Transformers for Music Source Separation},
+  author    = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
+  booktitle = {ICASSP},
+  year      = {2023}
+}
+```
+And if you use the benchmark or this packaging:
+```bibtex
+@misc{stemsplit_benchmark_2026,
+  title  = {StemSplit Stem-Separation Benchmark 2026},
+  author = {StemSplit},
+  year   = {2026},
+  url    = {https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026}
+}
+```
+- Original model: [`facebookresearch/demucs`][demucs-repo]
+- Packaging by [StemSplit](https://stemsplit.io)
+- Benchmark dataset: [StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)
+[demucs-repo]: https://github.com/facebookresearch/demucs

handler.py ADDED Viewed

	@@ -0,0 +1,110 @@

+"""
+HF Inference Endpoint handler for HT-Demucs FT.
+When deployed to an HF Inference Endpoint, HF instantiates EndpointHandler
+once at container startup (downloading the demucs checkpoints into the
+container cache), then calls __call__ on every HTTP request.
+Request shape:
+    POST /
+    Content-Type: application/json
+    {
+      "inputs": "<base64-encoded audio bytes; any libsndfile-readable format>",
+      "parameters": {
+        "stems": ["vocals", "drums", "bass", "other"]  // optional, defaults to all 4
+      }
+    }
+Response shape:
+    {
+      "vocals":  "<base64 WAV>",
+      "drums":   "<base64 WAV>",
+      "bass":    "<base64 WAV>",
+      "other":   "<base64 WAV>",
+      "sample_rate": 44100,
+      "duration_s":  123.4
+    }
+To deploy:
+    1) Create the endpoint in the HF UI (Deploy -> Inference Endpoints on the
+       model card), choose a GPU instance (T4 small minimum; L4 recommended)
+    2) Send requests as shown above.
+Or skip self-hosting and use the StemSplit API:
+    https://stemsplit.io/developers
+"""
+from __future__ import annotations
+import base64
+import io
+from typing import Any
+import numpy as np
+import soundfile as sf
+import torch
+from demucs.apply import apply_model
+from demucs.audio import convert_audio
+from demucs.pretrained import get_model
+DEFAULT_STEMS = ("vocals", "drums", "bass", "other")
+def _audio_to_b64_wav(audio: torch.Tensor, sample_rate: int) -> str:
+    """Encode a (channels, samples) FP32 tensor as base64-PCM16 WAV."""
+    np_audio = audio.cpu().numpy().T  # -> (samples, channels)
+    np_audio = np.clip(np_audio, -1.0, 1.0)
+    buf = io.BytesIO()
+    sf.write(buf, np_audio, sample_rate, subtype="PCM_16", format="WAV")
+    return base64.b64encode(buf.getvalue()).decode("ascii")
+class EndpointHandler:
+    """HF Inference Endpoint entrypoint."""
+    def __init__(self, path: str = "") -> None:
+        self.model = get_model("htdemucs_ft")
+        self.model.eval()
+        self.device = torch.device(
+            "cuda" if torch.cuda.is_available() else
+            "mps" if torch.backends.mps.is_available() else
+            "cpu"
+        )
+        self.model.to(self.device)
+        self.sample_rate = int(self.model.samplerate)
+        self.audio_channels = int(self.model.audio_channels)
+        self.sources = list(self.model.sources)
+    def __call__(self, data: dict[str, Any]) -> dict[str, Any]:
+        if "inputs" not in data:
+            return {"error": "Request body must include base64 audio under 'inputs'."}
+        audio_bytes = base64.b64decode(data["inputs"])
+        try:
+            wav_np, sr = sf.read(io.BytesIO(audio_bytes), dtype="float32", always_2d=True)
+        except Exception as e:  # noqa: BLE001
+            return {"error": f"Could not decode audio: {type(e).__name__}: {e}"}
+        # wav_np: (samples, channels) -> (channels, samples) FP32
+        wav = torch.from_numpy(wav_np.T).contiguous()
+        wav = convert_audio(wav, sr, self.sample_rate, self.audio_channels)
+        wav = wav.unsqueeze(0).to(self.device)  # (1, channels, samples)
+        # Optional stem filter
+        params = data.get("parameters", {}) or {}
+        requested_stems = [s for s in params.get("stems", DEFAULT_STEMS) if s in self.sources]
+        if not requested_stems:
+            requested_stems = list(self.sources)
+        with torch.no_grad():
+            # apply_model handles overlap-add segmentation internally
+            stems = apply_model(self.model, wav, device=str(self.device), progress=False)[0]
+            # stems: (n_sources, channels, samples) on `self.device`
+        out: dict[str, Any] = {
+            "sample_rate": self.sample_rate,
+            "duration_s": round(wav.shape[-1] / self.sample_rate, 3),
+        }
+        for stem in requested_stems:
+            idx = self.sources.index(stem)
+            out[stem] = _audio_to_b64_wav(stems[idx], self.sample_rate)
+        return out

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+torch>=2.2,<2.6
+torchaudio>=2.2,<2.6
+demucs==4.0.1
+numpy>=1.26,<2.0
+soundfile>=0.12