Demucs v4 TRT

HTDemucs 6-stem, exported to a single-graph ONNX checkpoint and compiled to TensorRT for native Windows inference.

drums · bass · other · vocals · guitar · piano

~5 seconds per song on an RTX 3090. No Python required at runtime.

What's in this repo

File	Description
`demucsv4.onnx`	Canonical checkpoint — full-graph HTDemucs 6s export with STFT/ISTFT internalized. Source for all TRT engine builds. ~235 MB
`demucsv4_sm86_trt10.15.trt`	Pre-built TRT engine for RTX 30-series (sm86) — compiled from the ONNX above under TensorRT 10.15 and FP16. ~157 MB

The ONNX is the artifact you want if you're on Linux, a non-Ampere GPU, or want to compile your own engine. The .trt is the artifact you want if you just want to run it on an RTX 3090/3080/3070/3060 Ti right now.

What is HTDemucs?

HTDemucs (Hybrid Transformer Demucs) is Meta AI's fourth-generation music source separation model, introduced in Hybrid Transformers for Music Source Separation (Rouard et al., ICASSP 2023).

Where earlier Demucs generations processed audio purely in the time domain, HTDemucs runs two parallel encoders simultaneously — one operating on the raw waveform, the other on the STFT spectrogram — with a Transformer Encoder with cross-attention at the bottleneck connecting them. This lets the model correlate time-domain and frequency-domain features before decoding, yielding measurably better separation quality — especially on spectrally complex, temporally sparse instruments like piano and guitar.

The htdemucs_6s variant adds dedicated guitar and piano stems on top of the standard drums/bass/other/vocals quad, making it the most capable publicly available separation model for music production use.

Why is this ONNX different?

Most HTDemucs ONNX exports externalize the FFT. Because ONNX does not natively support complex-valued tensors, the typical workaround (used by demucs.onnx and similar projects) is to run STFT in host code and pass the spectrogram as a second input alongside the waveform.

That works, but TensorRT only sees the truncated graph. FFT kernels live outside and cannot be fused with the adjacent convolutions. Memory must cross the host–device boundary at every chunk boundary. The optimization surface available to TRT is smaller.

This export internalizes the STFT. A WaveformOnlyWrapper calls model._spec() inside the forward pass before handing off to the model proper. The exported graph contains the complete computation: both encoder branches, all cross-attention, STFT and ISTFT, start to finish.

TensorRT receives the full dataflow. FFT kernel chains get fused with surrounding convolutions. The frequency- and time-domain encoders are compiled together. The Transformer layers are optimized as a unit. Nothing crosses the host–device boundary mid-inference.

Combined with FP16 Tensor Core compilation on Ampere (328 Tensor Core units on a 3090), the difference between this approach and naive ONNX Runtime inference is approximately 25–30× in wall-clock throughput — from several minutes per track to ~5 seconds.

Performance

Benchmarked with trtexec on an RTX 3090 (sm86, 24 GB VRAM), TensorRT 10.15.1, FP16.

Per-chunk latency (input: [1, 2, 343980] — ~7.8 seconds of audio):

Metric	Value
GPU compute — median	115.8 ms
GPU compute — mean	118.7 ms
GPU compute — p90	129.6 ms
Host latency — mean	120.2 ms
H2D transfer	~0.23 ms
D2H transfer	~1.26 ms
Throughput	8.3 chunks/sec

VRAM usage:

Allocation	Size
Engine weights	157 MiB
Execution context	403 MiB
Total	~560 MiB

Wall-clock on a full song: a 3-minute track is ~23 chunks at ~119 ms each — roughly 2.7 seconds of pure GPU compute. The ~5 second total includes host-side normalization, chunking, overlap-add accumulation, and WAV file I/O.

Raw benchmark log: benchmarks/trtexec_benchmark_sm86.txt

demucsv4.onnx                         ← this repo (canonical checkpoint)
    │
    ├─ build_engine.py (sm86) ──────→  demucsv4_sm86_trt10.15.trt   (RTX 30-series)  ← this repo
    ├─ build_engine.py (sm89) ──────→  demucsv4_sm89_trt10.15.trt   (RTX 40-series)
    ├─ build_engine.py (sm80) ──────→  demucsv4_sm80_trt10.15.trt   (A100, A6000)
    ├─ build_engine.py (sm75) ──────→  demucsv4_sm75_trt10.15.trt   (RTX 20-series)
    └─ build_engine.py (sm61) ──────→  demucsv4_sm61_trt10.15.trt   (GTX 10-series, FP32)

Quick Start (Windows, RTX 30-series)

Download the latest release from GitHub, extract, and run:

.\Demucs_v4_TRT.exe "song.mp3"

Stems land in .\stems\<song name>\ as WAV files.

Other NVIDIA GPUs — build engine from this ONNX

# 1. Clone the repo and run first-time setup
git clone https://github.com/MansfieldPlumbing/Demucs_v4_TRT
cd Demucs_v4_TRT
.\launch.bat

# 2. From the setup menu: [2] Preflight, then [4] Python environment

# 3. Download demucsv4.onnx from this HuggingFace repo into models\

# 4. Build an engine for your GPU
& "$env:LOCALAPPDATA\micromamba\micromamba.exe" run -n demucs-trt python build_engine.py

# 5. Run
.\Demucs_v4_TRT.exe "song.mp3"

Pascal (GTX 10-series) users: pass --fp32 to build_engine.py — no FP16 Tensor Cores on sm61.

GPU Compatibility

TRT engines are architecture-specific. build_engine.py auto-detects your GPU and names the output correctly.

Architecture	Cards	Notes
sm89	RTX 4090, 4080, 4070 Ti	Build from ONNX with `build_engine.py`
sm86	RTX 3090, 3080, 3070, 3060 Ti	✅ Pre-built engine included in this repo
sm80	A100, A6000	Build from ONNX with `build_engine.py`
sm75	RTX 2080, 2070, 2060, T4	Build from ONNX with `build_engine.py`
sm70	Tesla V100	Build from ONNX with `build_engine.py`
sm61	GTX 1080, 1070, 1060	`build_engine.py --fp32` — no FP16 on Pascal
< sm61	GTX 900 and older	❌ Not supported by TensorRT 10

Usage

Demucs_v4_TRT.exe <input> [options]

Options:
  -m <model.trt>    TRT engine path (auto-discovers *.trt in models\ if omitted)
  -o <dir>          Output directory  (default: .\stems\<song name>)
  -s                Single-chunk debug mode (first chunk only)

Examples:
  .\Demucs_v4_TRT.exe "song.mp3"
  .\Demucs_v4_TRT.exe "song.mp3" -m models\demucsv4_sm89_trt10.15.trt
  .\Demucs_v4_TRT.exe "song.mp3" -o D:\my_stems

Supported input: anything Windows Media Foundation decodes — MP3, WAV, FLAC, AAC, M4A.

Inference Pipeline

song.mp3
  │
  ▼  AudioFileReader + MediaFoundationResampler  (NAudio)
  │  → 44100 Hz · stereo · float32
  │
  ▼  Whole-song normalization  (mean/std of mono mix)
  │
  ▼  Chunked overlap-add inference
  │  chunk = 343,980 samples (~7.8 s)
  │  25% overlap · linear fade windowing
  │  [1, 2, 343980] → GPU → [1, 6, 2, 343980]
  │
  ▼  demucs_v4_trt.dll  (C++ TRT bridge, P/Invoke from C#)
  │  cudaMemcpyAsync → enqueueV3 → cudaMemcpyAsync
  │
  ▼  Denormalize · overlap-add accumulate · save
  │
  ▼  stems/<song name>/
       drums.wav  bass.wav  other.wav  vocals.wav  guitar.wav  piano.wav

Reproducing the ONNX

The ONNX can be reproduced from scratch using export_htdemucs.py — no GPU required for export, CPU is sufficient.

mrun python export_htdemucs.py
# → models\demucsv4.onnx

Critical detail: _spec() must run inside the graph via WaveformOnlyWrapper. A two-input ONNX (waveform + pre-computed spectrogram as separate inputs) will not achieve kernel fusion — TensorRT cannot see the FFT operations and cannot fuse them with the surrounding convolutions. The graph must be single-input for the optimization to work.

Build Requirements (source builds only)

Dependency	Version
NVIDIA Driver	≥ 561.0
CUDA Toolkit	≥ 13.0
TensorRT SDK	≥ 10.0
VS Build Tools 2022	C++ workload
.NET SDK	≥ 9.0
PowerShell	≥ 7.5

Python is not required at runtime. It is only used for engine building and ONNX export, managed via micromamba in a self-contained environment.

Intended Use

Music source separation for production, remixing, stem extraction, and audio research. The output is 6 WAV stems per track: drums, bass, other, vocals, guitar, and piano.

The underlying weights are htdemucs_6s from Meta AI, released under a research license. TensorRT engines compiled from this ONNX are for personal and research use. Commercial use is subject to Meta's license for the underlying model weights.

Citation

@inproceedings{rouard2023hybrid,
  title     = {Hybrid Transformers for Music Source Separation},
  author    = {Rouard, Simon and Massa, Francisco and D{\'e}fossez, Alexandre},
  booktitle = {ICASSP 2023},
  year      = {2023}
}

Paper for MansfieldPlumbing/Demucs_v4_TRT

Hybrid Transformers for Music Source Separation

Paper • 2211.08553 • Published Nov 15, 2022 • 1

MansfieldPlumbing
/

Demucs_v4_TRT