Vocal Separation Core (Mel-Band RoFormer) β ONNX / fp16 / WebGPU
ONNX export of a Mel-Band RoFormer vocal source-separation core, packaged for
the musetric packages/ai runtime
(Node + onnxruntime), tuned for the WebGPU execution provider.
The graph is the neural network only: it takes a precomputed STFT representation and returns per-stem masks. STFT, iSTFT, chunking and complex packing run host-side. This is not a drop-in PyTorch checkpoint.
Intended uses & limitations
Intended:
- Vocals / instrumental separation as the first stage of an audio pipeline.
- Edge/client inference via WebGPU through
onnxruntime(CPU and DML also work).
Out of scope:
- Standalone use without a host that computes the STFT input and applies
masks+ iSTFT (seemusetricpackages/ai). - Use in other training frameworks β this is an inference-only export.
Limitations:
- Static time window T = 501 (~5 s). It is the conservative 6 GB-GPU profile; longer context needs a different export.
- Training-data provenance of the upstream weights is undocumented.
How to use
The session runs the core; the host supplies stft_repr and consumes masks.
import * as ort from 'onnxruntime-node';
// .onnx and .onnx.data must sit in the same directory; .data loads automatically.
const session = await ort.InferenceSession.create('syhft_core_fp16_t501.onnx', {
executionProviders: ['webgpu'], // or 'cpu' / 'dml'
});
// stftRepr: Float32Array of shape [1, 2050, 501, 2], produced host-side from one
// ~5 s audio chunk (n_fft=2048, hop=441, 44.1 kHz, stereo).
const input = new ort.Tensor('float32', stftRepr, [1, 2050, 501, 2]);
const { masks } = await session.run({ stft_repr: input });
// masks: float32 [1, 1, 3958, 501, 2] -> apply to STFT, then iSTFT host-side.
See the musetric packages/ai host code for the full STFT/iSTFT and
chunk-recombination pipeline.
Variant & files
- Precision: fp16 weights/activations, fp32 graph I/O (
keep_io_types). - WebGPU hardening: matmul attention; wide
Concat/Splitrewritten into β€15-wide trees; fp16 attention softmax. Compat/perf only β values preserved.
| File | Size | SHA256 |
|---|---|---|
syhft_core_fp16_t501.onnx |
7,103,554 B | 4e6d3df35bca530893ea2a55bd1d7a78bc3721efbd51c8d3ed10eb3a19fa6d79 |
syhft_core_fp16_t501.onnx.data |
741,145,440 B | 1bbc7fed448872976b28710d03d9ec8b41f513dab9a3a9f0ff6493c8b5e5e22d |
Signature β opset ai.onnx 18 (IR 10):
| Tensor | Type | Shape | Meaning |
|---|---|---|---|
stft_repr (in) |
float32 | [1, 2050, 501, 2] |
batch, freq*2, time, complex |
masks (out) |
float32 | [1, 1, 3958, 501, 2] |
per-stem masks |
Validation
Two separate measurements.
Conversion fidelity β this fp16/WebGPU export vs the same model on Python ONNX CUDA, same T=501, same host DSP (isolates conversion + EP error):
| Stem | SNR | corr |
|---|---|---|
| vocals | 44.73 dB | 0.999983 |
| instrumental | 56.17 dB | 0.999999 |
NaN 0, silent gaps 0 β the export is numerically near-lossless.
Quality vs the full-context torch reference β this T=501 export vs the original torch model at T=1101 (~11 s context):
| Stem | SNR | corr |
|---|---|---|
| vocals | 21.68 dB | 0.9966 |
| instrumental | 33.11 dB | 0.99976 |
This gap is the cost of the 5 s context window, not a conversion defect (conversion alone is ~45β56 dB above). T=501 is the conservative 6 GB-GPU profile; a larger-T export narrows the gap on GPUs that can hold it.
Source & lineage
Code license and weight license are separate; ONNX conversion does not change the weight license. Documented only as far as it is verifiable.
- Architecture: Mel-Band RoFormer (arXiv:2310.01809).
- Reference implementation:
lucidrains/BS-RoFormer. - Training framework / config: ZFTurbo
Music-Source-Separation-Training. - Direct weight source:
SYH99999/MelBandRoformerBigSYHFTV1Fast@96f4ae8e3f690e51ef26b3bef84531c944f5341b, MIT.
The base checkpoint the upstream fine-tuned from is not documented upstream; we do not assert a chain we cannot verify. This export preserves the upstream MIT license; we do not claim authorship of the original weights.
License & citation
MIT, inherited from the upstream weights.
@article{wang2023melbandroformer,
title={Mel-Band RoFormer for Music Source Separation},
author={Wang, Ju-Chiang and Lu, Wei-Tsung and Won, Minz},
journal={arXiv preprint arXiv:2310.01809},
year={2023}
}