YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
USEF-TSE ONNX exports (audio-only, 8 kHz)
ONNX exports of the USEF-TSE target speaker extraction models from github.com/ZBang/USEF-TSE.
License
CC BY-NC 4.0 โ inherited from the upstream weights. Non-commercial use only.
Models
Two architectures ร three training datasets = six exports. All inputs are 8 kHz float32 mono PCM.
| File | Architecture | Training set | Size |
|---|---|---|---|
usef_tse_tfgridnet_wsj0-2mix.onnx |
TF-GridNet | WSJ0-2mix (clean) | 60 MB |
usef_tse_tfgridnet_wham.onnx |
TF-GridNet | WHAM! (noisy) | 60 MB |
usef_tse_tfgridnet_whamr.onnx |
TF-GridNet | WHAMR! (noisy+reverb) | 60 MB |
usef_tse_sepformer_wsj0-2mix.onnx |
SepFormer | WSJ0-2mix (clean) | 131 MB |
usef_tse_sepformer_wham.onnx |
SepFormer | WHAM! (noisy) | 131 MB |
usef_tse_sepformer_whamr.onnx |
SepFormer | WHAMR! (noisy+reverb) | 131 MB |
Per-dataset manifests in manifest_*.json carry SHA-256s and the parity numbers
PyTorch โ ONNX hit on real audio fixtures.
Inference contract
- Inputs:
mixture:[1, 16000]float32 โ 2 seconds @ 8 kHz monoenrollment:[1, 64000]float32 โ 8 seconds @ 8 kHz mono (zero-pad if shorter)
- Output:
extracted:[1, 16000]float32 โ 2 seconds @ 8 kHz mono (same length as mixture)
The 2 s mixture window is fixed because TF-GridNet bakes unfold constants in its ONNX graph. Longer audio must be chunked into 2 s windows and the outputs concatenated.
Exporter
Generated by iOS/scripts/export_usef_tse_onnx.py via legacy
TorchScript exporter at opset 17, with TF-GridNet's torch.stft/torch.istft
replaced by conv1d/conv_transpose1d-based equivalents (the legacy exporter
rejects complex-typed STFT outputs).
PyTorch โ ONNX parity on real 16 kHz audio fixtures (downsampled to 8 kHz for inference): cosine similarity = 1.000 across all 18 cells; max absolute difference โค 2.2e-3.