| | --- |
| | license: mit |
| | tags: |
| | - audio |
| | - music-source-separation |
| | - sound-separation |
| | - demucs |
| | - htdemucs |
| | - stem-separation |
| | - inference |
| | pipeline_tag: audio-to-audio |
| | --- |
| | |
| | ## Music Source Separation |
| |
|
| | This is the Demucs v4 models from Facebook Research. |
| |
|
| | --- |
| |
|
| | ## What is HTDemucs? |
| |
|
| | [HTDemucs (Hybrid Transformer Demucs)](https://github.com/facebookresearch/demucs) is Meta AI's fourth-generation music source separation model, introduced in [*Hybrid Transformers for Music Source Separation* (Rouard et al., ICASSP 2023)](https://arxiv.org/abs/2211.08553). |
| |
|
| | Where earlier Demucs generations processed audio purely in the time domain, HTDemucs runs **two parallel encoders simultaneously** — one operating on the raw waveform, the other on the STFT spectrogram — with a **Transformer Encoder with cross-attention** at the bottleneck connecting them. This lets the model correlate time-domain and frequency-domain features before decoding, yielding measurably better separation quality — especially on spectrally complex, temporally sparse instruments like piano and guitar. |
| |
|
| | The `htdemucs_6s` variant adds dedicated guitar and piano stems on top of the standard drums/bass/other/vocals quad, making it the most capable publicly available separation model for music production use. |
| |
|
| | --- |
| |
|
| | From Facebook research: |
| |
|
| | Demucs is based on U-Net convolutional architecture inspired by Wave-U-Net and SING, with GLUs, a BiLSTM between the encoder and decoder, specific initialization of weights and transposed convolutions in the decoder. |
| | |
| | See [facebookresearch's repository](https://github.com/facebookresearch/demucs) for more information on Demucs. |