bitsydarel
/

usef-tse-onnx

ONNX

Model card Files Files and versions

xet

Community

bitsydarel commited on 22 days ago

Commit

e33d0ce

verified ·

1 Parent(s): 8f619dc

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +48 -0

README.md ADDED Viewed

	@@ -0,0 +1,48 @@

+# USEF-TSE ONNX exports (audio-only, 8 kHz)
+ONNX exports of the USEF-TSE target speaker extraction models from
+[github.com/ZBang/USEF-TSE](https://github.com/ZBang/USEF-TSE).
+## License
+**CC BY-NC 4.0** — inherited from the upstream weights. Non-commercial use only.
+## Models
+Two architectures × three training datasets = six exports. All inputs are
+8 kHz float32 mono PCM.
+| File | Architecture | Training set | Size |
+|---|---|---|---|
+| `usef_tse_tfgridnet_wsj0-2mix.onnx` | TF-GridNet | WSJ0-2mix (clean) | 60 MB |
+| `usef_tse_tfgridnet_wham.onnx`      | TF-GridNet | WHAM! (noisy)     | 60 MB |
+| `usef_tse_tfgridnet_whamr.onnx`     | TF-GridNet | WHAMR! (noisy+reverb) | 60 MB |
+| `usef_tse_sepformer_wsj0-2mix.onnx` | SepFormer  | WSJ0-2mix (clean) | 131 MB |
+| `usef_tse_sepformer_wham.onnx`      | SepFormer  | WHAM! (noisy)     | 131 MB |
+| `usef_tse_sepformer_whamr.onnx`     | SepFormer  | WHAMR! (noisy+reverb) | 131 MB |
+Per-dataset manifests in `manifest_*.json` carry SHA-256s and the parity numbers
+PyTorch ↔ ONNX hit on real audio fixtures.
+## Inference contract
+- Inputs:
+  - `mixture`: `[1, 16000]` float32 — 2 seconds @ 8 kHz mono
+  - `enrollment`: `[1, 64000]` float32 — 8 seconds @ 8 kHz mono (zero-pad if shorter)
+- Output:
+  - `extracted`: `[1, 16000]` float32 — 2 seconds @ 8 kHz mono (same length as mixture)
+The 2 s mixture window is fixed because TF-GridNet bakes unfold constants in
+its ONNX graph. Longer audio must be chunked into 2 s windows and the outputs
+concatenated.
+## Exporter
+Generated by [`iOS/scripts/export_usef_tse_onnx.py`](https://github.com/bitsydarel/BDAIAssistant) via legacy
+TorchScript exporter at opset 17, with TF-GridNet's `torch.stft`/`torch.istft`
+replaced by conv1d/conv_transpose1d-based equivalents (the legacy exporter
+rejects complex-typed STFT outputs).
+PyTorch ↔ ONNX parity on real 16 kHz audio fixtures (downsampled to 8 kHz for
+inference): cosine similarity = 1.000 across all 18 cells; max absolute
+difference ≤ 2.2e-3.