BS PolarFormer – ONNX Vocal Separation

ONNX conversion of the BS PolarFormer vocal separation model from Music-Source-Separation-Training.

BS PolarFormer is a BSRoformer architecture with PoPE (Polar Positional Embeddings) instead of rotary embeddings. It separates vocals from other (instrumental) in stereo audio at 44.1kHz.

Files

File	Size	Description
`bs_polarformer.onnx`	201 MB	FP32 ONNX model (core: band split → transformers → mask estimator)
`bs_polarformer_fp16.onnx`	103 MB	FP16 quantized (weights stored as float16, ~same quality)
`model_bs_polarformer_float16.yaml`	3.6 KB	Model config
`convert_to_onnx.py`	19 KB	Conversion script (PyTorch → ONNX)
`run_onnx_inference.py`	7 KB	CLI inference script
`index.html`	18 KB	Web app (runs in browser via WebGPU/WASM)

Architecture

The ONNX model contains only the core neural network (51M parameters):

Audio → [STFT] → Core Model (ONNX) → [Mask] → [iSTFT] → Vocals
                  ├─ BandSplit (60 frequency bands)
                  ├─ 12× (TimeTransformer + FreqTransformer)
                  │   └─ 8-head attention, dim=256, PoPE embeddings
                  └─ MaskEstimator (2-layer MLP per band)

STFT/iSTFT are handled outside the ONNX model (in PyTorch or JavaScript).

Input: (batch, time_frames, 4100) — interleaved stereo STFT features (1025 freq × 2 channels × 2 real/imag)

Output: (batch, 1, 2050, time_frames, 2) — complex mask

Quality (vs PyTorch reference)

	FP32 ONNX	FP16 ONNX
Mask max abs diff	~1e-7	~4e-5
Audio SNR	107 dB	48.6 dB
Pearson correlation	1.00000000	0.99999642
Model size	201 MB	103 MB

Both are perceptually identical to the PyTorch model. The original model achieves SDR 11.00 on vocals (Multisong Dataset).

Usage

Python (ONNX Runtime)

pip install onnxruntime librosa soundfile pyyaml einops torch

# Download this repo, then:
python run_onnx_inference.py song.mp3 --output_dir output/
python run_onnx_inference.py song.mp3 --fp16  # use smaller model

Browser (WebGPU)

Serve the files with any HTTP server and open index.html:

python -m http.server 8080
# Open http://localhost:8080

Drop an audio file, select FP32 or FP16, and click "Separate Vocals". Uses WebGPU when available, falls back to WASM.

Convert from scratch

# Download checkpoint
wget https://github.com/ZFTurbo/Music-Source-Separation-Training/releases/download/v1.0.20/model_bs_polarformer_float16.ckpt

# Convert
python convert_to_onnx.py          # FP32 only
python convert_to_onnx.py --fp16   # FP32 + FP16

Credits

Original model & training: ZFTurbo/Music-Source-Separation-Training
BSRoformer architecture: lucidrains
PoPE embeddings: PoPE_pytorch

Downloads last month: -; Downloads are not tracked for this model. How to track