Demucs v4 TRT
HTDemucs 6-stem, exported to a single-graph ONNX checkpoint and compiled to TensorRT for native Windows inference.
drums Β· bass Β· other Β· vocals Β· guitar Β· piano
~5 seconds per song on an RTX 3090. No Python required at runtime.
What's in this repo
| File | Description |
|---|---|
demucsv4.onnx |
Canonical checkpoint β full-graph HTDemucs 6s export with STFT/ISTFT internalized. Source for all TRT engine builds. ~235 MB |
demucsv4_sm86_trt10.15.trt |
Pre-built TRT engine for RTX 30-series (sm86) β compiled from the ONNX above under TensorRT 10.15 and FP16. ~157 MB |
The ONNX is the artifact you want if you're on Linux, a non-Ampere GPU, or want to compile your own engine. The .trt is the artifact you want if you just want to run it on an RTX 3090/3080/3070/3060 Ti right now.
What is HTDemucs?
HTDemucs (Hybrid Transformer Demucs) is Meta AI's fourth-generation music source separation model, introduced in Hybrid Transformers for Music Source Separation (Rouard et al., ICASSP 2023).
Where earlier Demucs generations processed audio purely in the time domain, HTDemucs runs two parallel encoders simultaneously β one operating on the raw waveform, the other on the STFT spectrogram β with a Transformer Encoder with cross-attention at the bottleneck connecting them. This lets the model correlate time-domain and frequency-domain features before decoding, yielding measurably better separation quality β especially on spectrally complex, temporally sparse instruments like piano and guitar.
The htdemucs_6s variant adds dedicated guitar and piano stems on top of the standard drums/bass/other/vocals quad, making it the most capable publicly available separation model for music production use.
Why is this ONNX different?
Most HTDemucs ONNX exports externalize the FFT. Because ONNX does not natively support complex-valued tensors, the typical workaround (used by demucs.onnx and similar projects) is to run STFT in host code and pass the spectrogram as a second input alongside the waveform.
That works, but TensorRT only sees the truncated graph. FFT kernels live outside and cannot be fused with the adjacent convolutions. Memory must cross the hostβdevice boundary at every chunk boundary. The optimization surface available to TRT is smaller.
This export internalizes the STFT. A WaveformOnlyWrapper calls model._spec() inside the forward pass before handing off to the model proper. The exported graph contains the complete computation: both encoder branches, all cross-attention, STFT and ISTFT, start to finish.
TensorRT receives the full dataflow. FFT kernel chains get fused with surrounding convolutions. The frequency- and time-domain encoders are compiled together. The Transformer layers are optimized as a unit. Nothing crosses the hostβdevice boundary mid-inference.
Combined with FP16 Tensor Core compilation on Ampere (328 Tensor Core units on a 3090), the difference between this approach and naive ONNX Runtime inference is approximately 25β30Γ in wall-clock throughput β from several minutes per track to ~5 seconds.
Performance
Benchmarked with trtexec on an RTX 3090 (sm86, 24 GB VRAM), TensorRT 10.15.1, FP16.
Per-chunk latency (input: [1, 2, 343980] β ~7.8 seconds of audio):
| Metric | Value |
|---|---|
| GPU compute β median | 115.8 ms |
| GPU compute β mean | 118.7 ms |
| GPU compute β p90 | 129.6 ms |
| Host latency β mean | 120.2 ms |
| H2D transfer | ~0.23 ms |
| D2H transfer | ~1.26 ms |
| Throughput | 8.3 chunks/sec |
VRAM usage:
| Allocation | Size |
|---|---|
| Engine weights | 157 MiB |
| Execution context | 403 MiB |
| Total | ~560 MiB |
Wall-clock on a full song: a 3-minute track is ~23 chunks at ~119 ms each β roughly 2.7 seconds of pure GPU compute. The ~5 second total includes host-side normalization, chunking, overlap-add accumulation, and WAV file I/O.
Raw benchmark log:
benchmarks/trtexec_benchmark_sm86.txt
demucsv4.onnx β this repo (canonical checkpoint)
β
ββ build_engine.py (sm86) βββββββ demucsv4_sm86_trt10.15.trt (RTX 30-series) β this repo
ββ build_engine.py (sm89) βββββββ demucsv4_sm89_trt10.15.trt (RTX 40-series)
ββ build_engine.py (sm80) βββββββ demucsv4_sm80_trt10.15.trt (A100, A6000)
ββ build_engine.py (sm75) βββββββ demucsv4_sm75_trt10.15.trt (RTX 20-series)
ββ build_engine.py (sm61) βββββββ demucsv4_sm61_trt10.15.trt (GTX 10-series, FP32)
Quick Start (Windows, RTX 30-series)
Download the latest release from GitHub, extract, and run:
.\Demucs_v4_TRT.exe "song.mp3"
Stems land in .\stems\<song name>\ as WAV files.
Other NVIDIA GPUs β build engine from this ONNX
# 1. Clone the repo and run first-time setup
git clone https://github.com/MansfieldPlumbing/Demucs_v4_TRT
cd Demucs_v4_TRT
.\launch.bat
# 2. From the setup menu: [2] Preflight, then [4] Python environment
# 3. Download demucsv4.onnx from this HuggingFace repo into models\
# 4. Build an engine for your GPU
& "$env:LOCALAPPDATA\micromamba\micromamba.exe" run -n demucs-trt python build_engine.py
# 5. Run
.\Demucs_v4_TRT.exe "song.mp3"
Pascal (GTX 10-series) users: pass --fp32 to build_engine.py β no FP16 Tensor Cores on sm61.
GPU Compatibility
TRT engines are architecture-specific. build_engine.py auto-detects your GPU and names the output correctly.
| Architecture | Cards | Notes |
|---|---|---|
| sm89 | RTX 4090, 4080, 4070 Ti | Build from ONNX with build_engine.py |
| sm86 | RTX 3090, 3080, 3070, 3060 Ti | β Pre-built engine included in this repo |
| sm80 | A100, A6000 | Build from ONNX with build_engine.py |
| sm75 | RTX 2080, 2070, 2060, T4 | Build from ONNX with build_engine.py |
| sm70 | Tesla V100 | Build from ONNX with build_engine.py |
| sm61 | GTX 1080, 1070, 1060 | build_engine.py --fp32 β no FP16 on Pascal |
| < sm61 | GTX 900 and older | β Not supported by TensorRT 10 |
Usage
Demucs_v4_TRT.exe <input> [options]
Options:
-m <model.trt> TRT engine path (auto-discovers *.trt in models\ if omitted)
-o <dir> Output directory (default: .\stems\<song name>)
-s Single-chunk debug mode (first chunk only)
Examples:
.\Demucs_v4_TRT.exe "song.mp3"
.\Demucs_v4_TRT.exe "song.mp3" -m models\demucsv4_sm89_trt10.15.trt
.\Demucs_v4_TRT.exe "song.mp3" -o D:\my_stems
Supported input: anything Windows Media Foundation decodes β MP3, WAV, FLAC, AAC, M4A.
Inference Pipeline
song.mp3
β
βΌ AudioFileReader + MediaFoundationResampler (NAudio)
β β 44100 Hz Β· stereo Β· float32
β
βΌ Whole-song normalization (mean/std of mono mix)
β
βΌ Chunked overlap-add inference
β chunk = 343,980 samples (~7.8 s)
β 25% overlap Β· linear fade windowing
β [1, 2, 343980] β GPU β [1, 6, 2, 343980]
β
βΌ demucs_v4_trt.dll (C++ TRT bridge, P/Invoke from C#)
β cudaMemcpyAsync β enqueueV3 β cudaMemcpyAsync
β
βΌ Denormalize Β· overlap-add accumulate Β· save
β
βΌ stems/<song name>/
drums.wav bass.wav other.wav vocals.wav guitar.wav piano.wav
Reproducing the ONNX
The ONNX can be reproduced from scratch using export_htdemucs.py β no GPU required for export, CPU is sufficient.
mrun python export_htdemucs.py
# β models\demucsv4.onnx
Critical detail: _spec() must run inside the graph via WaveformOnlyWrapper. A two-input ONNX (waveform + pre-computed spectrogram as separate inputs) will not achieve kernel fusion β TensorRT cannot see the FFT operations and cannot fuse them with the surrounding convolutions. The graph must be single-input for the optimization to work.
Build Requirements (source builds only)
| Dependency | Version |
|---|---|
| NVIDIA Driver | β₯ 561.0 |
| CUDA Toolkit | β₯ 13.0 |
| TensorRT SDK | β₯ 10.0 |
| VS Build Tools 2022 | C++ workload |
| .NET SDK | β₯ 9.0 |
| PowerShell | β₯ 7.5 |
Python is not required at runtime. It is only used for engine building and ONNX export, managed via micromamba in a self-contained environment.
Intended Use
Music source separation for production, remixing, stem extraction, and audio research. The output is 6 WAV stems per track: drums, bass, other, vocals, guitar, and piano.
The underlying weights are htdemucs_6s from Meta AI, released under a research license. TensorRT engines compiled from this ONNX are for personal and research use. Commercial use is subject to Meta's license for the underlying model weights.
Citation
@inproceedings{rouard2023hybrid,
title = {Hybrid Transformers for Music Source Separation},
author = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
booktitle = {ICASSP 2023},
year = {2023}
}
Links
- Downloads last month
- 48