Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# USEF-TSE ONNX exports (audio-only, 8 kHz)
|
| 2 |
+
|
| 3 |
+
ONNX exports of the USEF-TSE target speaker extraction models from
|
| 4 |
+
[github.com/ZBang/USEF-TSE](https://github.com/ZBang/USEF-TSE).
|
| 5 |
+
|
| 6 |
+
## License
|
| 7 |
+
|
| 8 |
+
**CC BY-NC 4.0** — inherited from the upstream weights. Non-commercial use only.
|
| 9 |
+
|
| 10 |
+
## Models
|
| 11 |
+
|
| 12 |
+
Two architectures × three training datasets = six exports. All inputs are
|
| 13 |
+
8 kHz float32 mono PCM.
|
| 14 |
+
|
| 15 |
+
| File | Architecture | Training set | Size |
|
| 16 |
+
|---|---|---|---|
|
| 17 |
+
| `usef_tse_tfgridnet_wsj0-2mix.onnx` | TF-GridNet | WSJ0-2mix (clean) | 60 MB |
|
| 18 |
+
| `usef_tse_tfgridnet_wham.onnx` | TF-GridNet | WHAM! (noisy) | 60 MB |
|
| 19 |
+
| `usef_tse_tfgridnet_whamr.onnx` | TF-GridNet | WHAMR! (noisy+reverb) | 60 MB |
|
| 20 |
+
| `usef_tse_sepformer_wsj0-2mix.onnx` | SepFormer | WSJ0-2mix (clean) | 131 MB |
|
| 21 |
+
| `usef_tse_sepformer_wham.onnx` | SepFormer | WHAM! (noisy) | 131 MB |
|
| 22 |
+
| `usef_tse_sepformer_whamr.onnx` | SepFormer | WHAMR! (noisy+reverb) | 131 MB |
|
| 23 |
+
|
| 24 |
+
Per-dataset manifests in `manifest_*.json` carry SHA-256s and the parity numbers
|
| 25 |
+
PyTorch ↔ ONNX hit on real audio fixtures.
|
| 26 |
+
|
| 27 |
+
## Inference contract
|
| 28 |
+
|
| 29 |
+
- Inputs:
|
| 30 |
+
- `mixture`: `[1, 16000]` float32 — 2 seconds @ 8 kHz mono
|
| 31 |
+
- `enrollment`: `[1, 64000]` float32 — 8 seconds @ 8 kHz mono (zero-pad if shorter)
|
| 32 |
+
- Output:
|
| 33 |
+
- `extracted`: `[1, 16000]` float32 — 2 seconds @ 8 kHz mono (same length as mixture)
|
| 34 |
+
|
| 35 |
+
The 2 s mixture window is fixed because TF-GridNet bakes unfold constants in
|
| 36 |
+
its ONNX graph. Longer audio must be chunked into 2 s windows and the outputs
|
| 37 |
+
concatenated.
|
| 38 |
+
|
| 39 |
+
## Exporter
|
| 40 |
+
|
| 41 |
+
Generated by [`iOS/scripts/export_usef_tse_onnx.py`](https://github.com/bitsydarel/BDAIAssistant) via legacy
|
| 42 |
+
TorchScript exporter at opset 17, with TF-GridNet's `torch.stft`/`torch.istft`
|
| 43 |
+
replaced by conv1d/conv_transpose1d-based equivalents (the legacy exporter
|
| 44 |
+
rejects complex-typed STFT outputs).
|
| 45 |
+
|
| 46 |
+
PyTorch ↔ ONNX parity on real 16 kHz audio fixtures (downsampled to 8 kHz for
|
| 47 |
+
inference): cosine similarity = 1.000 across all 18 cells; max absolute
|
| 48 |
+
difference ≤ 2.2e-3.
|