htdemucs-onnx (YouStem build)
ONNX export of HTDemucs (Hybrid Transformer Demucs v4) for in-browser music source separation. Splits a track into 4 stems: drums, bass, other, vocals.
This build powers YouStem, a Chrome/Brave extension that
separates stems entirely on-device via WebGPU (onnxruntime-web). No audio ever
leaves the user's machine.
Provenance and license
This model was reconverted from the official MIT-licensed HTDemucs weights
published by Meta in facebookresearch/demucs,
using the demucs Python package (demucs.pretrained.get_model("htdemucs")).
- Architecture & weights: HTDemucs by Meta Platforms (first author: Simon Rouard), MIT License. See https://github.com/facebookresearch/demucs.
- This ONNX export: MIT License, © 2026 Ghilda.
The weights are numerically identical to the upstream MIT release; only the graph was exported to ONNX with the STFT/iSTFT externalised (see below).
What is different from a plain demucs export
The short-time Fourier transform (STFT) and its inverse are not part of this graph. They are computed in JavaScript by the host application. The model:
- takes the raw waveform and a pre-computed complex spectrogram as inputs;
- returns the two HTDemucs branches (frequency mask + time waveform) separately, so the application performs the iSTFT and the final sum.
This keeps the ONNX graph free of FFT operators (which are awkward in
onnxruntime-web) while remaining numerically equivalent to the reference model.
Inputs
| name | shape | dtype | meaning |
|---|---|---|---|
mix |
[1, 2, 343980] |
float32 | raw stereo waveform, 44.1 kHz, 7.8 s segment |
mag |
[1, 4, 2048, 336] |
float32 | complex STFT as channels: [L.real, L.imag, R.real, R.imag], un-normalised (the model normalises internally) |
Outputs
| name | shape | dtype | meaning |
|---|---|---|---|
freq |
[1, 4, 4, 2048, 336] |
float32 | frequency branch, complex-as-channels mask per source |
time |
[1, 4, 2, 343980] |
float32 | time branch, waveform per source |
Source order: ['drums', 'bass', 'other', 'vocals'].
Spectrogram parameters
sample_rate=44100, n_fft=4096, hop_length=1024, segment=7.8 s
(343980 samples), freq_bins=2048, frames=336.
Specs
- Opset 18, 100% standard ONNX operators (no custom domains).
- ~166 MB, float32.
Citation
@article{rouard2022hybrid,
title={Hybrid Transformers for Music Source Separation},
author={Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
journal={ICASSP 2023},
year={2023}
}