bitsydarel commited on
Commit
e33d0ce
·
verified ·
1 Parent(s): 8f619dc

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +48 -0
README.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # USEF-TSE ONNX exports (audio-only, 8 kHz)
2
+
3
+ ONNX exports of the USEF-TSE target speaker extraction models from
4
+ [github.com/ZBang/USEF-TSE](https://github.com/ZBang/USEF-TSE).
5
+
6
+ ## License
7
+
8
+ **CC BY-NC 4.0** — inherited from the upstream weights. Non-commercial use only.
9
+
10
+ ## Models
11
+
12
+ Two architectures × three training datasets = six exports. All inputs are
13
+ 8 kHz float32 mono PCM.
14
+
15
+ | File | Architecture | Training set | Size |
16
+ |---|---|---|---|
17
+ | `usef_tse_tfgridnet_wsj0-2mix.onnx` | TF-GridNet | WSJ0-2mix (clean) | 60 MB |
18
+ | `usef_tse_tfgridnet_wham.onnx` | TF-GridNet | WHAM! (noisy) | 60 MB |
19
+ | `usef_tse_tfgridnet_whamr.onnx` | TF-GridNet | WHAMR! (noisy+reverb) | 60 MB |
20
+ | `usef_tse_sepformer_wsj0-2mix.onnx` | SepFormer | WSJ0-2mix (clean) | 131 MB |
21
+ | `usef_tse_sepformer_wham.onnx` | SepFormer | WHAM! (noisy) | 131 MB |
22
+ | `usef_tse_sepformer_whamr.onnx` | SepFormer | WHAMR! (noisy+reverb) | 131 MB |
23
+
24
+ Per-dataset manifests in `manifest_*.json` carry SHA-256s and the parity numbers
25
+ PyTorch ↔ ONNX hit on real audio fixtures.
26
+
27
+ ## Inference contract
28
+
29
+ - Inputs:
30
+ - `mixture`: `[1, 16000]` float32 — 2 seconds @ 8 kHz mono
31
+ - `enrollment`: `[1, 64000]` float32 — 8 seconds @ 8 kHz mono (zero-pad if shorter)
32
+ - Output:
33
+ - `extracted`: `[1, 16000]` float32 — 2 seconds @ 8 kHz mono (same length as mixture)
34
+
35
+ The 2 s mixture window is fixed because TF-GridNet bakes unfold constants in
36
+ its ONNX graph. Longer audio must be chunked into 2 s windows and the outputs
37
+ concatenated.
38
+
39
+ ## Exporter
40
+
41
+ Generated by [`iOS/scripts/export_usef_tse_onnx.py`](https://github.com/bitsydarel/BDAIAssistant) via legacy
42
+ TorchScript exporter at opset 17, with TF-GridNet's `torch.stft`/`torch.istft`
43
+ replaced by conv1d/conv_transpose1d-based equivalents (the legacy exporter
44
+ rejects complex-typed STFT outputs).
45
+
46
+ PyTorch ↔ ONNX parity on real 16 kHz audio fixtures (downsampled to 8 kHz for
47
+ inference): cosine similarity = 1.000 across all 18 cells; max absolute
48
+ difference ≤ 2.2e-3.