BS PolarFormer – ONNX Vocal Separation

ONNX conversion of the BS PolarFormer vocal separation model from Music-Source-Separation-Training.

BS PolarFormer is a BSRoformer architecture with PoPE (Polar Positional Embeddings) instead of rotary embeddings. It separates vocals from other (instrumental) in stereo audio at 44.1kHz.

Files

File Size Description
bs_polarformer.onnx 201 MB FP32 ONNX model (core: band split β†’ transformers β†’ mask estimator)
bs_polarformer_fp16.onnx 103 MB FP16 quantized (weights stored as float16, ~same quality)
model_bs_polarformer_float16.yaml 3.6 KB Model config
convert_to_onnx.py 19 KB Conversion script (PyTorch β†’ ONNX)
run_onnx_inference.py 7 KB CLI inference script
index.html 18 KB Web app (runs in browser via WebGPU/WASM)

Architecture

The ONNX model contains only the core neural network (51M parameters):

Audio β†’ [STFT] β†’ Core Model (ONNX) β†’ [Mask] β†’ [iSTFT] β†’ Vocals
                  β”œβ”€ BandSplit (60 frequency bands)
                  β”œβ”€ 12Γ— (TimeTransformer + FreqTransformer)
                  β”‚   └─ 8-head attention, dim=256, PoPE embeddings
                  └─ MaskEstimator (2-layer MLP per band)

STFT/iSTFT are handled outside the ONNX model (in PyTorch or JavaScript).

Input: (batch, time_frames, 4100) β€” interleaved stereo STFT features (1025 freq Γ— 2 channels Γ— 2 real/imag)

Output: (batch, 1, 2050, time_frames, 2) β€” complex mask

Quality (vs PyTorch reference)

FP32 ONNX FP16 ONNX
Mask max abs diff ~1e-7 ~4e-5
Audio SNR 107 dB 48.6 dB
Pearson correlation 1.00000000 0.99999642
Model size 201 MB 103 MB

Both are perceptually identical to the PyTorch model. The original model achieves SDR 11.00 on vocals (Multisong Dataset).

Usage

Python (ONNX Runtime)

pip install onnxruntime librosa soundfile pyyaml einops torch

# Download this repo, then:
python run_onnx_inference.py song.mp3 --output_dir output/
python run_onnx_inference.py song.mp3 --fp16  # use smaller model

Browser (WebGPU)

Serve the files with any HTTP server and open index.html:

python -m http.server 8080
# Open http://localhost:8080

Drop an audio file, select FP32 or FP16, and click "Separate Vocals". Uses WebGPU when available, falls back to WASM.

Convert from scratch

# Download checkpoint
wget https://github.com/ZFTurbo/Music-Source-Separation-Training/releases/download/v1.0.20/model_bs_polarformer_float16.ckpt

# Convert
python convert_to_onnx.py          # FP32 only
python convert_to_onnx.py --fp16   # FP32 + FP16

Credits

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support