Vocal Separation Core (Mel-Band RoFormer) β€” ONNX / fp16 / WebGPU

ONNX export of a Mel-Band RoFormer vocal source-separation core, packaged for the musetric packages/ai runtime (Node + onnxruntime), tuned for the WebGPU execution provider.

The graph is the neural network only: it takes a precomputed STFT representation and returns per-stem masks. STFT, iSTFT, chunking and complex packing run host-side. This is not a drop-in PyTorch checkpoint.

Intended uses & limitations

Intended:

  • Vocals / instrumental separation as the first stage of an audio pipeline.
  • Edge/client inference via WebGPU through onnxruntime (CPU and DML also work).

Out of scope:

  • Standalone use without a host that computes the STFT input and applies masks + iSTFT (see musetric packages/ai).
  • Use in other training frameworks β€” this is an inference-only export.

Limitations:

  • Static time window T = 501 (~5 s). It is the conservative 6 GB-GPU profile; longer context needs a different export.
  • Training-data provenance of the upstream weights is undocumented.

How to use

The session runs the core; the host supplies stft_repr and consumes masks.

import * as ort from 'onnxruntime-node';

// .onnx and .onnx.data must sit in the same directory; .data loads automatically.
const session = await ort.InferenceSession.create('syhft_core_fp16_t501.onnx', {
  executionProviders: ['webgpu'], // or 'cpu' / 'dml'
});

// stftRepr: Float32Array of shape [1, 2050, 501, 2], produced host-side from one
// ~5 s audio chunk (n_fft=2048, hop=441, 44.1 kHz, stereo).
const input = new ort.Tensor('float32', stftRepr, [1, 2050, 501, 2]);
const { masks } = await session.run({ stft_repr: input });
// masks: float32 [1, 1, 3958, 501, 2] -> apply to STFT, then iSTFT host-side.

See the musetric packages/ai host code for the full STFT/iSTFT and chunk-recombination pipeline.

Variant & files

  • Precision: fp16 weights/activations, fp32 graph I/O (keep_io_types).
  • WebGPU hardening: matmul attention; wide Concat/Split rewritten into ≀15-wide trees; fp16 attention softmax. Compat/perf only β€” values preserved.
File Size SHA256
syhft_core_fp16_t501.onnx 7,103,554 B 4e6d3df35bca530893ea2a55bd1d7a78bc3721efbd51c8d3ed10eb3a19fa6d79
syhft_core_fp16_t501.onnx.data 741,145,440 B 1bbc7fed448872976b28710d03d9ec8b41f513dab9a3a9f0ff6493c8b5e5e22d

Signature β€” opset ai.onnx 18 (IR 10):

Tensor Type Shape Meaning
stft_repr (in) float32 [1, 2050, 501, 2] batch, freq*2, time, complex
masks (out) float32 [1, 1, 3958, 501, 2] per-stem masks

Validation

Two separate measurements.

Conversion fidelity β€” this fp16/WebGPU export vs the same model on Python ONNX CUDA, same T=501, same host DSP (isolates conversion + EP error):

Stem SNR corr
vocals 44.73 dB 0.999983
instrumental 56.17 dB 0.999999

NaN 0, silent gaps 0 β€” the export is numerically near-lossless.

Quality vs the full-context torch reference β€” this T=501 export vs the original torch model at T=1101 (~11 s context):

Stem SNR corr
vocals 21.68 dB 0.9966
instrumental 33.11 dB 0.99976

This gap is the cost of the 5 s context window, not a conversion defect (conversion alone is ~45–56 dB above). T=501 is the conservative 6 GB-GPU profile; a larger-T export narrows the gap on GPUs that can hold it.

Source & lineage

Code license and weight license are separate; ONNX conversion does not change the weight license. Documented only as far as it is verifiable.

The base checkpoint the upstream fine-tuned from is not documented upstream; we do not assert a chain we cannot verify. This export preserves the upstream MIT license; we do not claim authorship of the original weights.

License & citation

MIT, inherited from the upstream weights.

@article{wang2023melbandroformer,
  title={Mel-Band RoFormer for Music Source Separation},
  author={Wang, Ju-Chiang and Lu, Wei-Tsung and Won, Minz},
  journal={arXiv preprint arXiv:2310.01809},
  year={2023}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for musetric/vocal-separation-roformer-onnx