htdemucs-onnx (YouStem build)

ONNX export of HTDemucs (Hybrid Transformer Demucs v4) for in-browser music source separation. Splits a track into 4 stems: drums, bass, other, vocals.

This build powers YouStem, a Chrome/Brave extension that separates stems entirely on-device via WebGPU (onnxruntime-web). No audio ever leaves the user's machine.

Provenance and license

This model was reconverted from the official MIT-licensed HTDemucs weights published by Meta in facebookresearch/demucs, using the demucs Python package (demucs.pretrained.get_model("htdemucs")).

Architecture & weights: HTDemucs by Meta Platforms (first author: Simon Rouard), MIT License. See https://github.com/facebookresearch/demucs.
This ONNX export: MIT License, © 2026 Ghilda.

The weights are numerically identical to the upstream MIT release; only the graph was exported to ONNX with the STFT/iSTFT externalised (see below).

What is different from a plain demucs export

The short-time Fourier transform (STFT) and its inverse are not part of this graph. They are computed in JavaScript by the host application. The model:

takes the raw waveform and a pre-computed complex spectrogram as inputs;
returns the two HTDemucs branches (frequency mask + time waveform) separately, so the application performs the iSTFT and the final sum.

This keeps the ONNX graph free of FFT operators (which are awkward in onnxruntime-web) while remaining numerically equivalent to the reference model.

Inputs

name	shape	dtype	meaning
`mix`	`[1, 2, 343980]`	float32	raw stereo waveform, 44.1 kHz, 7.8 s segment
`mag`	`[1, 4, 2048, 336]`	float32	complex STFT as channels: `[L.real, L.imag, R.real, R.imag]`, un-normalised (the model normalises internally)

Outputs

name	shape	dtype	meaning
`freq`	`[1, 4, 4, 2048, 336]`	float32	frequency branch, complex-as-channels mask per source
`time`	`[1, 4, 2, 343980]`	float32	time branch, waveform per source

Source order: ['drums', 'bass', 'other', 'vocals'].

Spectrogram parameters

sample_rate=44100, n_fft=4096, hop_length=1024, segment=7.8 s (343980 samples), freq_bins=2048, frames=336.

Specs

Opset 18, 100% standard ONNX operators (no custom domains).
~166 MB, float32.

Citation

@article{rouard2022hybrid,
  title={Hybrid Transformers for Music Source Separation},
  author={Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
  journal={ICASSP 2023},
  year={2023}
}

Downloads last month: -; Downloads are not tracked for this model. How to track