Vocal Separation Core (Mel-Band RoFormer) — ONNX / fp16 / WebGPU

ONNX export of a Mel-Band RoFormer vocal source-separation core, packaged for the musetric packages/ai runtime (onnxruntime-web on WebGPU).

The graph is the neural network core: it takes a precomputed STFT representation and returns per-bin complex masks. The mel-band gather/average tables are baked into the ONNX graph, so the host does not need sidecar /tables/* assets. STFT, iSTFT, chunking and complex packing run host-side (browser WGSL + FFT). This is not a drop-in PyTorch checkpoint.

Intended uses & limitations

Intended:

Vocals / instrumental separation as the first stage of an audio pipeline.
Client/edge inference via WebGPU through onnxruntime-web.

Out of scope:

Standalone use without a host that computes the STFT input, applies the per-bin masks, and runs iSTFT (see musetric packages/ai).
Use in other training frameworks — this is an inference-only export.

Limitations:

Static time window T = 1101 (~11 s) — the model's full reference context.
The graph uses com.microsoft fused ops (FastGelu, MultiHeadAttention, RMSNormalization) and is tuned for the WebGPU execution provider.
Training-data provenance of the upstream weights is undocumented.

How to use

The session runs the core; the host supplies stft_repr and consumes masks.

import * as ort from 'onnxruntime-web/webgpu';

// .onnx and .onnx.data must sit in the same directory; .data loads automatically.
const session = await ort.InferenceSession.create(
  'syhft_core_folded_fp16_webgpu.onnx',
  { executionProviders: ['webgpu'] },
);

// stftRepr: Float32Array of shape [1, 2050, 1101, 2], produced host-side from one
// ~11 s audio chunk (n_fft=2048, hop=441, 44.1 kHz, stereo).
const input = new ort.Tensor('float32', stftRepr, [1, 2050, 1101, 2]);
const { masks } = await session.run({ stft_repr: input });
// masks: float32 [1, 2050, 1101, 2] -> apply to STFT, then iSTFT host-side.

See the musetric packages/ai host code for the full STFT/iSTFT and chunk-recombination pipeline.

Variant & files

Precision: fp16 weights, fp32 graph I/O, fp32 RMSNorm islands.
Mel-band gather/average tables embedded into the graph output tail.
WebGPU hardening: FastGelu, MultiHeadAttention, RMSNormalization, and wide Concat/Split rewritten into <=8-wide trees so every shader stays at <=9 storage buffers — under the strictest shipping cap (Dawn/Metal on macOS reports maxStorageBuffersPerShaderStage = 10). Compat/perf only; values preserved apart from fp16 conversion.

File	Size	SHA256
`syhft_core_folded_fp16_webgpu.onnx`	5,312,568 B	`e22f33a2895f8cc244e28494197a7c77d7a65101d0aa00dbafc626ed16a0cbdb`
`syhft_core_folded_fp16_webgpu.onnx.data`	741,190,540 B	`b08cfc80905e3560a4dd5d30f641299a47dd96d309ebbe9524d9d6c9d2a0356f`

Signature — opset ai.onnx 23 + com.microsoft 1 (IR 10):

Tensor	Type	Shape	Meaning
`stft_repr` (in)	float32	`[1, 2050, 1101, 2]`	batch, freq*2, time, complex
`masks` (out)	float32	`[1, 2050, 1101, 2]`	per-bin complex masks, already gathered/averaged from mel bands

Validation

This fp16/WebGPU export vs the PyTorch first-stage reference at the same T = 1101 (isolates conversion + execution-provider error):

Metric	Value
SNR (vocals)	~46–49 dB
correlation	~0.999
NaN / silent gaps	0

T = 1101 is the model's full reference context, so there is no context-window penalty — the gap is conversion + EP error only, which is numerically small. Re-run the parity gate on the exact published bytes before relying on it.

Source & lineage

Code license and weight license are separate; ONNX conversion does not change the weight license. Documented only as far as it is verifiable.

Architecture: Mel-Band RoFormer (arXiv:2310.01809).
Reference implementation: lucidrains/BS-RoFormer.
Training framework / config: ZFTurbo Music-Source-Separation-Training.
Direct weight source: SYH99999/MelBandRoformerBigSYHFTV1Fast @ 96f4ae8e3f690e51ef26b3bef84531c944f5341b, MIT.

The base checkpoint the upstream fine-tuned from is not documented upstream; we do not assert a chain we cannot verify. This export preserves the upstream MIT license; we do not claim authorship of the original weights.

License & citation

MIT, inherited from the upstream weights.

@article{wang2023melbandroformer,
  title={Mel-Band RoFormer for Music Source Separation},
  author={Wang, Ju-Chiang and Lu, Wei-Tsung and Won, Minz},
  journal={arXiv preprint arXiv:2310.01809},
  year={2023}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for musetric/vocal-separation-roformer-onnx

Mel-Band RoFormer for Music Source Separation

Paper • 2310.01809 • Published Oct 3, 2023