Add MLX port of MERL MRX (default_ checkpoint, fp32) — 3-stem soundtrack separation

Browse files

Files changed (3) hide show

README.md +53 -0
config.json +23 -0
model.safetensors +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,53 @@

+---
+license: mit
+library_name: mlx
+tags:
+  - mlx
+  - audio
+  - audio-source-separation
+  - speech
+  - music
+  - apple-silicon
+pipeline_tag: audio-to-audio
+---
+# Cocktail-Fork-MRX (MLX)
+Apple **MLX** port of MERL's **MRX** (Multi-Resolution CrossNet) — separates a
+soundtrack mixture into three stems: **music**, **speech**, and **sound effects (sfx)**.
+Runs natively on Apple Silicon, no PyTorch at inference.
+- **Upstream:** [merlresearch/cocktail-fork-separation](https://github.com/merlresearch/cocktail-fork-separation) — *The Cocktail Fork Problem: Three-Stem Audio Separation for Real-World Soundtracks* (ICASSP 2022).
+- **Checkpoint:** `default_` (SNR-loss trained — the upstream default inference weights).
+- **License:** MIT.
+- **Parity:** numerically exact vs the PyTorch reference (full-pipeline max_abs ≈ `9e-8`; per-stem SI-SDR 107–139 dB vs torch).
+## Usage
+```bash
+pip install cocktail-fork-mlx   # or: pip install git+https://github.com/xocialize/cocktail-fork-mlx
+cocktail-fork-mlx --audio-path soundtrack.wav --out-dir ./out
+# -> out/music.wav  out/speech.wav  out/sfx.wav
+```
+```python
+import mlx.core as mx, soundfile as sf, numpy as np
+from cocktail_fork_mlx.separate import separate_soundtrack
+from cocktail_fork_mlx.weights import from_pretrained
+audio, fs = sf.read("soundtrack.wav", always_2d=True)   # 44.1 kHz
+model = from_pretrained("mlx-community/Cocktail-Fork-MRX")
+stems = separate_soundtrack(mx.array(audio.T.astype("float32")), model)
+for name, x in stems.items():
+    sf.write(f"{name}.wav", np.array(x).T, 44100)
+```
+## Model
+- 44.1 kHz, any channel count. ~30.6M params, fp32 (122 MB).
+- Multi-resolution STFT (windows 1024/2048/8192, hop 256) → per-resolution magnitude
+  encoders → 3 parallel bidirectional CrossNet LSTMs → per-source/per-resolution mask
+  decoders → masked iSTFT summed across resolutions.
+- CPU is the faster device for this LSTM-bound model (default in the CLI).
+Ported by MVS Collective (xocialize). MIT, © MERL for the original model/weights.

config.json ADDED Viewed

	@@ -0,0 +1,23 @@

+{
+  "model_type": "mrx",
+  "architecture": "MRX (Multi-Resolution CrossNet)",
+  "n_sources": 3,
+  "window_lengths": [
+    1024,
+    2048,
+    8192
+  ],
+  "n_hop": 256,
+  "n_hidden": 512,
+  "n_lstm_layers": 3,
+  "sample_rate": 44100,
+  "source_names": [
+    "music",
+    "speech",
+    "sfx"
+  ],
+  "upstream": "merlresearch/cocktail-fork-separation",
+  "license": "MIT",
+  "port_version": "0.1.0",
+  "dtype": "float32"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:17244f6f1ded8b3430a757a2d2a72bdf2e88eecd035f81a54ebb74f6d5f79884
+size 122284399