bgkb
/

bs_polarformer

+---
+license: mit
+tags:
+  - music-source-separation
+  - vocal-separation
+  - onnx
+  - webgpu
+  - audio
+pipeline_tag: audio-to-audio
+library_name: onnxruntime
+base_model: ZFTurbo/Music-Source-Separation-Training
+---
+# BS PolarFormer – ONNX Vocal Separation
+ONNX conversion of the **BS PolarFormer** vocal separation model from
+[Music-Source-Separation-Training](https://github.com/ZFTurbo/Music-Source-Separation-Training).
+BS PolarFormer is a BSRoformer architecture with **PoPE** (Polar Positional Embeddings)
+instead of rotary embeddings. It separates **vocals** from **other** (instrumental) in stereo audio at 44.1kHz.
+## Files
+| File | Size | Description |
+|------|------|-------------|
+| `bs_polarformer.onnx` | 201 MB | FP32 ONNX model (core: band split → transformers → mask estimator) |
+| `bs_polarformer_fp16.onnx` | 103 MB | FP16 quantized (weights stored as float16, ~same quality) |
+| `model_bs_polarformer_float16.yaml` | 3.6 KB | Model config |
+| `convert_to_onnx.py` | 19 KB | Conversion script (PyTorch → ONNX) |
+| `run_onnx_inference.py` | 7 KB | CLI inference script |
+| `index.html` | 18 KB | Web app (runs in browser via WebGPU/WASM) |
+## Architecture
+The ONNX model contains only the **core neural network** (51M parameters):
+```
+Audio → [STFT] → Core Model (ONNX) → [Mask] → [iSTFT] → Vocals
+                  ├─ BandSplit (60 frequency bands)
+                  ├─ 12× (TimeTransformer + FreqTransformer)
+                  │   └─ 8-head attention, dim=256, PoPE embeddings
+                  └─ MaskEstimator (2-layer MLP per band)
+```
+STFT/iSTFT are handled outside the ONNX model (in PyTorch or JavaScript).
+**Input:** `(batch, time_frames, 4100)` — interleaved stereo STFT features (1025 freq × 2 channels × 2 real/imag)
+**Output:** `(batch, 1, 2050, time_frames, 2)` — complex mask
+## Quality (vs PyTorch reference)
+| | FP32 ONNX | FP16 ONNX |
+|---|---|---|
+| Mask max abs diff | ~1e-7 | ~4e-5 |
+| Audio SNR | 107 dB | 48.6 dB |
+| Pearson correlation | 1.00000000 | 0.99999642 |
+| Model size | 201 MB | 103 MB |
+Both are perceptually identical to the PyTorch model. The original model achieves **SDR 11.00** on vocals (Multisong Dataset).
+## Usage
+### Python (ONNX Runtime)
+```bash
+pip install onnxruntime librosa soundfile pyyaml einops torch
+# Download this repo, then:
+python run_onnx_inference.py song.mp3 --output_dir output/
+python run_onnx_inference.py song.mp3 --fp16  # use smaller model
+```
+### Browser (WebGPU)
+Serve the files with any HTTP server and open `index.html`:
+```bash
+python -m http.server 8080
+# Open http://localhost:8080
+```
+Drop an audio file, select FP32 or FP16, and click "Separate Vocals". Uses WebGPU when available, falls back to WASM.
+### Convert from scratch
+```bash
+# Download checkpoint
+wget https://github.com/ZFTurbo/Music-Source-Separation-Training/releases/download/v1.0.20/model_bs_polarformer_float16.ckpt
+# Convert
+python convert_to_onnx.py          # FP32 only
+python convert_to_onnx.py --fp16   # FP32 + FP16
+```
+## Credits
+- Original model & training: [ZFTurbo/Music-Source-Separation-Training](https://github.com/ZFTurbo/Music-Source-Separation-Training)
+- BSRoformer architecture: [lucidrains](https://github.com/lucidrains)
+- PoPE embeddings: [PoPE_pytorch](https://pypi.org/project/PoPE-pytorch/)