BS PolarFormer β ONNX Vocal Separation
ONNX conversion of the BS PolarFormer vocal separation model from Music-Source-Separation-Training.
BS PolarFormer is a BSRoformer architecture with PoPE (Polar Positional Embeddings) instead of rotary embeddings. It separates vocals from other (instrumental) in stereo audio at 44.1kHz.
Files
| File | Size | Description |
|---|---|---|
bs_polarformer.onnx |
201 MB | FP32 ONNX model (core: band split β transformers β mask estimator) |
bs_polarformer_fp16.onnx |
103 MB | FP16 quantized (weights stored as float16, ~same quality) |
model_bs_polarformer_float16.yaml |
3.6 KB | Model config |
convert_to_onnx.py |
19 KB | Conversion script (PyTorch β ONNX) |
run_onnx_inference.py |
7 KB | CLI inference script |
index.html |
18 KB | Web app (runs in browser via WebGPU/WASM) |
Architecture
The ONNX model contains only the core neural network (51M parameters):
Audio β [STFT] β Core Model (ONNX) β [Mask] β [iSTFT] β Vocals
ββ BandSplit (60 frequency bands)
ββ 12Γ (TimeTransformer + FreqTransformer)
β ββ 8-head attention, dim=256, PoPE embeddings
ββ MaskEstimator (2-layer MLP per band)
STFT/iSTFT are handled outside the ONNX model (in PyTorch or JavaScript).
Input: (batch, time_frames, 4100) β interleaved stereo STFT features (1025 freq Γ 2 channels Γ 2 real/imag)
Output: (batch, 1, 2050, time_frames, 2) β complex mask
Quality (vs PyTorch reference)
| FP32 ONNX | FP16 ONNX | |
|---|---|---|
| Mask max abs diff | ~1e-7 | ~4e-5 |
| Audio SNR | 107 dB | 48.6 dB |
| Pearson correlation | 1.00000000 | 0.99999642 |
| Model size | 201 MB | 103 MB |
Both are perceptually identical to the PyTorch model. The original model achieves SDR 11.00 on vocals (Multisong Dataset).
Usage
Python (ONNX Runtime)
pip install onnxruntime librosa soundfile pyyaml einops torch
# Download this repo, then:
python run_onnx_inference.py song.mp3 --output_dir output/
python run_onnx_inference.py song.mp3 --fp16 # use smaller model
Browser (WebGPU)
Serve the files with any HTTP server and open index.html:
python -m http.server 8080
# Open http://localhost:8080
Drop an audio file, select FP32 or FP16, and click "Separate Vocals". Uses WebGPU when available, falls back to WASM.
Convert from scratch
# Download checkpoint
wget https://github.com/ZFTurbo/Music-Source-Separation-Training/releases/download/v1.0.20/model_bs_polarformer_float16.ckpt
# Convert
python convert_to_onnx.py # FP32 only
python convert_to_onnx.py --fp16 # FP32 + FP16
Credits
- Original model & training: ZFTurbo/Music-Source-Separation-Training
- BSRoformer architecture: lucidrains
- PoPE embeddings: PoPE_pytorch