feat: initial release (HT-Demucs FT, PyTorch, ready for HF Inference Endpoints)
Browse files- README.md +229 -0
- handler.py +110 -0
- requirements.txt +5 -0
README.md
ADDED
|
@@ -0,0 +1,229 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: en
|
| 3 |
+
license: mit
|
| 4 |
+
library_name: demucs
|
| 5 |
+
pipeline_tag: audio-to-audio
|
| 6 |
+
tags:
|
| 7 |
+
- stem-separation
|
| 8 |
+
- source-separation
|
| 9 |
+
- vocal-isolation
|
| 10 |
+
- vocal-remover
|
| 11 |
+
- music
|
| 12 |
+
- demucs
|
| 13 |
+
- htdemucs
|
| 14 |
+
- karaoke
|
| 15 |
+
- audio-to-audio
|
| 16 |
+
datasets:
|
| 17 |
+
- StemSplitio/stem-separation-benchmark-2026
|
| 18 |
+
inference: false
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# HT-Demucs FT — Production-ready PyTorch model card
|
| 22 |
+
|
| 23 |
+
The **highest-vocal-SDR open-source stem separator on MUSDB18-HQ** (9.19 dB
|
| 24 |
+
median), packaged for Hugging Face Inference Endpoints with a ready-to-deploy
|
| 25 |
+
`handler.py`. Use it for vocal removal, karaoke generation, acapella
|
| 26 |
+
extraction, and any task that needs clean 4-stem separation of music
|
| 27 |
+
(`vocals`, `drums`, `bass`, `other`).
|
| 28 |
+
|
| 29 |
+
This is the `htdemucs_ft` 4-bag ensemble by [Défossez et al. (Meta AI)][demucs-repo],
|
| 30 |
+
repackaged with attribution. Original training and weights are unchanged;
|
| 31 |
+
we add the deployment handler, the model card, and the benchmark context.
|
| 32 |
+
|
| 33 |
+
> Need it as a REST API today, without standing up GPUs? Use the
|
| 34 |
+
> [**StemSplit API**](https://stemsplit.io/developers) — same model, hosted
|
| 35 |
+
> for you, with credits and a dashboard.
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## Quality (independently benchmarked)
|
| 40 |
+
|
| 41 |
+
Median SDR per stem on the standard MUSDB18-HQ test split (50 songs), BSS
|
| 42 |
+
Eval v4 via `museval`. Higher is better. Source:
|
| 43 |
+
[StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)
|
| 44 |
+
v1.1.
|
| 45 |
+
|
| 46 |
+
| Model | vocals | drums | bass | other |
|
| 47 |
+
|---|---:|---:|---:|---:|
|
| 48 |
+
| `htdemucs_ft` *(this card)* | **9.19** | 10.11 | 10.38 | 6.34 |
|
| 49 |
+
| `mdx_extra_q` | 9.04 | **11.49** | **11.42** | **7.67** |
|
| 50 |
+
| `htdemucs_6s` | 8.66 | 9.54 | 9.11 | 5.74 |
|
| 51 |
+
| `htdemucs` | 8.53 | 10.01 | 9.78 | 6.42 |
|
| 52 |
+
| `mdx_net_inst_hq3` *(vocals-only)* | 5.81 | — | — | — |
|
| 53 |
+
|
| 54 |
+
**Pick this model when vocals are the priority** — it beats every other
|
| 55 |
+
open-source separator on MUSDB18-HQ vocals. For drums/bass-focused work,
|
| 56 |
+
consider `mdx_extra_q` instead.
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## Quick start (Python)
|
| 61 |
+
|
| 62 |
+
```python
|
| 63 |
+
import base64, io, soundfile as sf
|
| 64 |
+
from huggingface_hub import InferenceClient
|
| 65 |
+
|
| 66 |
+
with open("your-song.mp3", "rb") as f:
|
| 67 |
+
audio_b64 = base64.b64encode(f.read()).decode()
|
| 68 |
+
|
| 69 |
+
client = InferenceClient(model="StemSplitio/htdemucs-ft-pytorch")
|
| 70 |
+
result = client.post(json={"inputs": audio_b64})
|
| 71 |
+
|
| 72 |
+
for stem in ("vocals", "drums", "bass", "other"):
|
| 73 |
+
wav, sr = sf.read(io.BytesIO(base64.b64decode(result[stem])))
|
| 74 |
+
sf.write(f"out_{stem}.wav", wav, sr)
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
Or run locally without Hugging Face at all:
|
| 78 |
+
|
| 79 |
+
```python
|
| 80 |
+
import torch, soundfile as sf
|
| 81 |
+
from demucs.apply import apply_model
|
| 82 |
+
from demucs.audio import convert_audio
|
| 83 |
+
from demucs.pretrained import get_model
|
| 84 |
+
|
| 85 |
+
model = get_model("htdemucs_ft").eval()
|
| 86 |
+
wav, sr = sf.read("your-song.mp3", dtype="float32", always_2d=True)
|
| 87 |
+
wav = torch.from_numpy(wav.T).contiguous()
|
| 88 |
+
wav = convert_audio(wav, sr, model.samplerate, model.audio_channels).unsqueeze(0)
|
| 89 |
+
|
| 90 |
+
with torch.no_grad():
|
| 91 |
+
stems = apply_model(model, wav, device="mps" if torch.backends.mps.is_available() else "cpu")[0]
|
| 92 |
+
|
| 93 |
+
for i, name in enumerate(model.sources): # ["drums", "bass", "other", "vocals"]
|
| 94 |
+
sf.write(f"out_{name}.wav", stems[i].T.numpy(), model.samplerate)
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## Deploy on Hugging Face Inference Endpoints
|
| 100 |
+
|
| 101 |
+
Click **Deploy → Inference Endpoints** above, pick a GPU instance, and HF
|
| 102 |
+
will spin up a container running [`handler.py`](handler.py). Recommended
|
| 103 |
+
hardware tiers based on M4 Pro reference latency:
|
| 104 |
+
|
| 105 |
+
| Hardware | RTF | Latency for 3-min song |
|
| 106 |
+
|---|---:|---:|
|
| 107 |
+
| NVIDIA L4 | ~0.04 | ~7 s |
|
| 108 |
+
| NVIDIA T4 small | ~0.10 | ~18 s |
|
| 109 |
+
| CPU x4 (basic) | ~0.7 | ~125 s |
|
| 110 |
+
|
| 111 |
+
Then call the endpoint:
|
| 112 |
+
|
| 113 |
+
```bash
|
| 114 |
+
curl -X POST https://<your-endpoint>.endpoints.huggingface.cloud \
|
| 115 |
+
-H "Authorization: Bearer $HF_TOKEN" \
|
| 116 |
+
-H "Content-Type: application/json" \
|
| 117 |
+
-d "{\"inputs\": \"$(base64 < your-song.mp3)\"}"
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
Response is a JSON object with `vocals`, `drums`, `bass`, `other`
|
| 121 |
+
base64-encoded WAVs at 44.1 kHz.
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
## Skip the infrastructure — use the StemSplit API
|
| 126 |
+
|
| 127 |
+
If you'd rather not run your own endpoint, the
|
| 128 |
+
[**StemSplit API**](https://stemsplit.io/developers) wraps this same model
|
| 129 |
+
(and the rest of the [benchmarked lineup](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026))
|
| 130 |
+
behind a hosted REST API with credits, a dashboard, and webhooks.
|
| 131 |
+
|
| 132 |
+
```bash
|
| 133 |
+
curl -X POST https://stemsplit.io/api/v1/jobs \
|
| 134 |
+
-H "Authorization: Bearer $STEMSPLIT_API_KEY" \
|
| 135 |
+
-F "audio=@your-song.mp3" \
|
| 136 |
+
-F "model=htdemucs_ft"
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
- 📘 [Developer docs](https://stemsplit.io/developers/docs)
|
| 140 |
+
- 🔌 [API reference](https://stemsplit.io/developers/reference)
|
| 141 |
+
- 📚 [Guides & recipes](https://stemsplit.io/developers/guides)
|
| 142 |
+
|
| 143 |
+
Or try it in your browser, no code:
|
| 144 |
+
|
| 145 |
+
- 🎤 [**Vocal Remover** (free online tool)](https://stemsplit.io/vocal-remover)
|
| 146 |
+
— upload a song, get an instrumental + isolated vocals
|
| 147 |
+
- 🎶 [Karaoke Maker](https://stemsplit.io/karaoke-maker) —
|
| 148 |
+
same model, optimised for karaoke output
|
| 149 |
+
- 🎙️ [Acapella Maker](https://stemsplit.io/acapella-maker) — extract clean
|
| 150 |
+
vocal acapellas for remixes and sampling
|
| 151 |
+
- 📺 [YouTube Stem Splitter](https://stemsplit.io/youtube-stem-splitter) —
|
| 152 |
+
paste a YouTube URL, get the stems
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
## Performance
|
| 157 |
+
|
| 158 |
+
Measured on an Apple M4 Pro (24 GB unified memory) with PyTorch 2.4 MPS, for
|
| 159 |
+
the full 4-bag ensemble on 50 MUSDB18-HQ tracks (median track length ~4 min,
|
| 160 |
+
RTF 0.26 ± 0.02). Cloud GPU numbers are extrapolated from public Demucs
|
| 161 |
+
benchmarks.
|
| 162 |
+
|
| 163 |
+
| Hardware | Per 3-min song | Peak RAM | Notes |
|
| 164 |
+
|---|---:|---:|---|
|
| 165 |
+
| Apple M4 Pro (MPS) | ~47 s | 3.1 GB | Measured in [our benchmark](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026) (RTF 0.26) |
|
| 166 |
+
| NVIDIA L4 (CUDA) | ~7 s | 4 GB | Extrapolated |
|
| 167 |
+
| NVIDIA T4 small (CUDA) | ~18 s | 4 GB | Extrapolated |
|
| 168 |
+
| CPU (8-core) | ~125 s | 3 GB | Slow, but works for batch jobs |
|
| 169 |
+
|
| 170 |
+
---
|
| 171 |
+
|
| 172 |
+
## How `htdemucs_ft` differs from the other Demucs models
|
| 173 |
+
|
| 174 |
+
| Variant | Bag size | Best at | When to choose |
|
| 175 |
+
|---|---:|---|---|
|
| 176 |
+
| `htdemucs_ft` *(this)* | 4 | **Vocals** | Karaoke, vocal isolation, acapella extraction |
|
| 177 |
+
| `htdemucs` | 1 | Balanced | Lower latency / smaller deploy |
|
| 178 |
+
| `htdemucs_6s` | 1 | 6-stem (adds piano, guitar) | When you need piano/guitar separately |
|
| 179 |
+
| `mdx_extra_q` | 4 | **Drums, bass** | Music production where rhythm section is the priority |
|
| 180 |
+
|
| 181 |
+
See the full
|
| 182 |
+
[stem-separation benchmark dataset](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)
|
| 183 |
+
for SDR / ISR / SIR / SAR across all stems.
|
| 184 |
+
|
| 185 |
+
---
|
| 186 |
+
|
| 187 |
+
## Files in this repo
|
| 188 |
+
|
| 189 |
+
- [`handler.py`](handler.py) — `EndpointHandler` class HF Inference Endpoints
|
| 190 |
+
calls on each request. Accepts base64 audio in, returns base64 stems out.
|
| 191 |
+
- [`requirements.txt`](requirements.txt) — Python deps (torch, demucs, soundfile).
|
| 192 |
+
- `README.md` — this card.
|
| 193 |
+
|
| 194 |
+
Model weights are downloaded into the container's torch hub cache on first
|
| 195 |
+
run (no `.pt` / `.th` files are stored in this repo to keep it small).
|
| 196 |
+
|
| 197 |
+
---
|
| 198 |
+
|
| 199 |
+
## License & attribution
|
| 200 |
+
|
| 201 |
+
This repo is **MIT-licensed**, matching the original HT-Demucs.
|
| 202 |
+
|
| 203 |
+
**Please cite the original authors** if you use this model in research:
|
| 204 |
+
|
| 205 |
+
```bibtex
|
| 206 |
+
@inproceedings{rouard2023hybrid,
|
| 207 |
+
title = {Hybrid Transformers for Music Source Separation},
|
| 208 |
+
author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
|
| 209 |
+
booktitle = {ICASSP},
|
| 210 |
+
year = {2023}
|
| 211 |
+
}
|
| 212 |
+
```
|
| 213 |
+
|
| 214 |
+
And if you use the benchmark or this packaging:
|
| 215 |
+
|
| 216 |
+
```bibtex
|
| 217 |
+
@misc{stemsplit_benchmark_2026,
|
| 218 |
+
title = {StemSplit Stem-Separation Benchmark 2026},
|
| 219 |
+
author = {StemSplit},
|
| 220 |
+
year = {2026},
|
| 221 |
+
url = {https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026}
|
| 222 |
+
}
|
| 223 |
+
```
|
| 224 |
+
|
| 225 |
+
- Original model: [`facebookresearch/demucs`][demucs-repo]
|
| 226 |
+
- Packaging by [StemSplit](https://stemsplit.io)
|
| 227 |
+
- Benchmark dataset: [StemSplitio/stem-separation-benchmark-2026](https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026)
|
| 228 |
+
|
| 229 |
+
[demucs-repo]: https://github.com/facebookresearch/demucs
|
handler.py
ADDED
|
@@ -0,0 +1,110 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
HF Inference Endpoint handler for HT-Demucs FT.
|
| 3 |
+
|
| 4 |
+
When deployed to an HF Inference Endpoint, HF instantiates EndpointHandler
|
| 5 |
+
once at container startup (downloading the demucs checkpoints into the
|
| 6 |
+
container cache), then calls __call__ on every HTTP request.
|
| 7 |
+
|
| 8 |
+
Request shape:
|
| 9 |
+
POST /
|
| 10 |
+
Content-Type: application/json
|
| 11 |
+
{
|
| 12 |
+
"inputs": "<base64-encoded audio bytes; any libsndfile-readable format>",
|
| 13 |
+
"parameters": {
|
| 14 |
+
"stems": ["vocals", "drums", "bass", "other"] // optional, defaults to all 4
|
| 15 |
+
}
|
| 16 |
+
}
|
| 17 |
+
|
| 18 |
+
Response shape:
|
| 19 |
+
{
|
| 20 |
+
"vocals": "<base64 WAV>",
|
| 21 |
+
"drums": "<base64 WAV>",
|
| 22 |
+
"bass": "<base64 WAV>",
|
| 23 |
+
"other": "<base64 WAV>",
|
| 24 |
+
"sample_rate": 44100,
|
| 25 |
+
"duration_s": 123.4
|
| 26 |
+
}
|
| 27 |
+
|
| 28 |
+
To deploy:
|
| 29 |
+
1) Create the endpoint in the HF UI (Deploy -> Inference Endpoints on the
|
| 30 |
+
model card), choose a GPU instance (T4 small minimum; L4 recommended)
|
| 31 |
+
2) Send requests as shown above.
|
| 32 |
+
|
| 33 |
+
Or skip self-hosting and use the StemSplit API:
|
| 34 |
+
https://stemsplit.io/developers
|
| 35 |
+
"""
|
| 36 |
+
from __future__ import annotations
|
| 37 |
+
|
| 38 |
+
import base64
|
| 39 |
+
import io
|
| 40 |
+
from typing import Any
|
| 41 |
+
|
| 42 |
+
import numpy as np
|
| 43 |
+
import soundfile as sf
|
| 44 |
+
import torch
|
| 45 |
+
from demucs.apply import apply_model
|
| 46 |
+
from demucs.audio import convert_audio
|
| 47 |
+
from demucs.pretrained import get_model
|
| 48 |
+
|
| 49 |
+
DEFAULT_STEMS = ("vocals", "drums", "bass", "other")
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def _audio_to_b64_wav(audio: torch.Tensor, sample_rate: int) -> str:
|
| 53 |
+
"""Encode a (channels, samples) FP32 tensor as base64-PCM16 WAV."""
|
| 54 |
+
np_audio = audio.cpu().numpy().T # -> (samples, channels)
|
| 55 |
+
np_audio = np.clip(np_audio, -1.0, 1.0)
|
| 56 |
+
buf = io.BytesIO()
|
| 57 |
+
sf.write(buf, np_audio, sample_rate, subtype="PCM_16", format="WAV")
|
| 58 |
+
return base64.b64encode(buf.getvalue()).decode("ascii")
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
class EndpointHandler:
|
| 62 |
+
"""HF Inference Endpoint entrypoint."""
|
| 63 |
+
|
| 64 |
+
def __init__(self, path: str = "") -> None:
|
| 65 |
+
self.model = get_model("htdemucs_ft")
|
| 66 |
+
self.model.eval()
|
| 67 |
+
self.device = torch.device(
|
| 68 |
+
"cuda" if torch.cuda.is_available() else
|
| 69 |
+
"mps" if torch.backends.mps.is_available() else
|
| 70 |
+
"cpu"
|
| 71 |
+
)
|
| 72 |
+
self.model.to(self.device)
|
| 73 |
+
self.sample_rate = int(self.model.samplerate)
|
| 74 |
+
self.audio_channels = int(self.model.audio_channels)
|
| 75 |
+
self.sources = list(self.model.sources)
|
| 76 |
+
|
| 77 |
+
def __call__(self, data: dict[str, Any]) -> dict[str, Any]:
|
| 78 |
+
if "inputs" not in data:
|
| 79 |
+
return {"error": "Request body must include base64 audio under 'inputs'."}
|
| 80 |
+
|
| 81 |
+
audio_bytes = base64.b64decode(data["inputs"])
|
| 82 |
+
try:
|
| 83 |
+
wav_np, sr = sf.read(io.BytesIO(audio_bytes), dtype="float32", always_2d=True)
|
| 84 |
+
except Exception as e: # noqa: BLE001
|
| 85 |
+
return {"error": f"Could not decode audio: {type(e).__name__}: {e}"}
|
| 86 |
+
|
| 87 |
+
# wav_np: (samples, channels) -> (channels, samples) FP32
|
| 88 |
+
wav = torch.from_numpy(wav_np.T).contiguous()
|
| 89 |
+
wav = convert_audio(wav, sr, self.sample_rate, self.audio_channels)
|
| 90 |
+
wav = wav.unsqueeze(0).to(self.device) # (1, channels, samples)
|
| 91 |
+
|
| 92 |
+
# Optional stem filter
|
| 93 |
+
params = data.get("parameters", {}) or {}
|
| 94 |
+
requested_stems = [s for s in params.get("stems", DEFAULT_STEMS) if s in self.sources]
|
| 95 |
+
if not requested_stems:
|
| 96 |
+
requested_stems = list(self.sources)
|
| 97 |
+
|
| 98 |
+
with torch.no_grad():
|
| 99 |
+
# apply_model handles overlap-add segmentation internally
|
| 100 |
+
stems = apply_model(self.model, wav, device=str(self.device), progress=False)[0]
|
| 101 |
+
# stems: (n_sources, channels, samples) on `self.device`
|
| 102 |
+
|
| 103 |
+
out: dict[str, Any] = {
|
| 104 |
+
"sample_rate": self.sample_rate,
|
| 105 |
+
"duration_s": round(wav.shape[-1] / self.sample_rate, 3),
|
| 106 |
+
}
|
| 107 |
+
for stem in requested_stems:
|
| 108 |
+
idx = self.sources.index(stem)
|
| 109 |
+
out[stem] = _audio_to_b64_wav(stems[idx], self.sample_rate)
|
| 110 |
+
return out
|
requirements.txt
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
torch>=2.2,<2.6
|
| 2 |
+
torchaudio>=2.2,<2.6
|
| 3 |
+
demucs==4.0.1
|
| 4 |
+
numpy>=1.26,<2.0
|
| 5 |
+
soundfile>=0.12
|